All posts by ajohnson

Shallow comparison of ASP.NET to Flex

Flex seems pretty interesting when you realize how similar it is to something like ASP.NET. Look how similar this snippet of Flex:

<?xml version="1.0" encoding="iso-8859-1"?>
<mx:Application xmlns:mx="http://www.macromedia.com/2003/mxml">
   <mx:script>
   function copy() {
      destination.text=source.text
      }
   </mx:script>
   <mx:TextInput id="source" width="100"/>
   <mx:Button label="Copy" click="copy()"/>
   <mx:TextInput id="destination" width="100"/>
</mx:Application>

to this snippet of ASP.NET code:

<script language="C#" runat="server">
   public void Copy(Object src, EventArgs e) {
      destination.Text = source.Text;
   }
</script>
<form runat="server">
<asp:textbox id="source" size="30" runat="server" />
<asp:button id="copy" OnClick="Copy" text="Copy" runat="server" />
<asp:textbox id="destination" size="30" runat="server" />
</form>

I can’t wait to see the IDE rolled out for Eclipse and the .NET version. Cool stuff Macromedia!

QueryParser … in NLucene

Misleading title. I implemented the first of the examples that Erik Hatcher used in his
article about the Lucene QueryParser
, only I used NLucene. Lucene and NLucene are very similar, so if anything, it’s interesting only because it highlights a couple of the differences between C# and Java.

First, here’s the Java example taken directly from Erik’s article:

public static void search(File indexDir, String q) {
  Directory fsDir = FSDirectory.getDirectory(indexDir, false);
  IndexSearcher is = new IndexSearcher(fsDir);
  Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
  Hits hits = is.search(query);
  System.out.println("Found " hits.length() +
    " document(s) that matched query '" q "':");
  for (int i = 0; i
The NLucene version looks eerily similar:

public static void Search(DirectoryInfo indexDir, string q) {
  DotnetPark.NLucene.Store.Directory fsDir = FsDirectory.GetDirectory(indexDir, false);
  IndexSearcher searcher = new IndexSearcher(fsDir);
  Query query = QueryParser.Parse(q, "contents", new StandardAnalyzer());
  Hits hits = searcher.Search(query);
  Console.WriteLine("Found " + hits.Length +
    " document(s) that matched query '" + q + "':");
  for (int i = 0; i
The differences are mainly syntax.

First, Erik used the variable name 'is' for his IndexSearcher. In C# 'is' is a keyword, so I switched the variable name to 'searcher'. If you're really geeky, you might want to brush up on all the Java keywords and the C# keywords.

Second, while Java uses the File class to describe directories and files, the .NET Framework uses the DirectoryInfo class.

Third, Java programmers are encouraged to capitalize class names and use camel Case notation for method and variable names while C# programmers are encouraged to Pascal notation for methods and camel Case for variables, so I switched the static method name from 'search' to 'Search'.

Next, 'Directory' is a system class, so the reference to the NLucene directory needed to be fully qualified:

DotnetPark.NLucene.Store.Directory fsDir = FsDirectory.GetDirectory(indexDir, false);

rather than this:

Directory fsDir = FsDirectory.GetDirectory(indexDir, false);

Finally, the Hits class contains a couple differences. Java programmers use the length() method on a variety of classes, so it made sense for the Java version to use a length() method as well. C# introduced the idea of a property, which is nothing more than syntactic sweetness that allows the API developer to encapsulate the implementation of a variable, but allow access to it as if it were a public field. The end result is that instead of writing:

for (int i = 0; i
in Java, you'd use this in C#:

for (int i = 0; i
The authors of Lucene also decided to use the C# indexer functionality (which I wrote about a couple days ago) so that an instance of the Hits class can be accessed as if it were an array:

Document doc = hits[i].Document;

I put together a complete sample that you can download and compile yourself if you're interested in using NLucene. Download it here.

Lucene’s Query API

Erik Hatcher wrote an excellent article on the specifics of Lucene’s Query API, specifically on how the QueryParser class uses the Query subclasses including TermQuery, PhraseQuery, RangeQuery, WildcardQuery, PrefixQuery, FuzzyQuery and BooleanQuery. Very useful stuff.

Not unsurprisingly, he’s also writing a book on Lucene titled “Lucene in Action”, to be published by Manning.

2003 Lightweight Languages Workshop notes

Joe and I went to the Lightweight Languages Workshop at MIT this past Saturday. In short, a bunch of nerds got together for the entire day to talk about stuff like haskell, lisp, lua, scheme, and boundaries. If you’re at all interested, you can watch the webcasts (which are of surprisingly good quality) in Real Media or Windows Media here and Joe wrote up his notes already. I’m behind a bit, so here’s mine a couple days late.

The initial session “Toward a LL Testing Framework for Large Distributed Systems” was especially interesting to me for a couple reasons: a) it was based on technology deployed for DARPA called UltraLog, b) ultralog is an “… ultra-survivable multi-agent society” and c) it uses Jabber, Python and Ruby. Specifically, they used the above technologies to enable them to get an up close and personal look at how their entire system (in this case 1000’s of agents) was performing. Said another way, they created a way to quickly and unobtrusively gather information from a variety of datapoints while the program is running. If you have a single website on 1 server, this problem doesn’t matter to you much. But imagine a system of 5000 servers (which is something I was asked to imagine this past Monday, more on that at a later time). An application running on 5000 servers would generate an unuseable amount of information; simple logging statements won’t help you. I’m rambling though. The interesting takeaway from all this is the idea of creating instrumentation for your applications [google search for ‘code instrumentation’].

URLs harvested from the other sessions:

· Web Authoring System Haskell (WASH)

· XS: Lisp on Lego MindStorms

· the idea of continuations, where:

def foo(x):
  return x+1

becomes:

def foo(x,c):
  c(x+1)

· dynamic proxies [googled] [javaworld.com] [onjava.com]

· The Great Computer Language Shootout

· lua: embeddable in C/C++, Java, Fortran, Ruby, OPL, C#…. runs in Palm OS, Brew, Playstation II, XBox and Symbian.

· c minus minus

· scheme

CFX_Lucene updates

Couple people have written me in the last couple days with updates they’ve done to the Lucene and ColdFusion tags I wrote a couple months ago.

First, Nick Burch from Torchbox updated the CFX tag so that it “… behaves better under error conditions and … the command line debug now works.” I also read here that they (torchbox) are hoping to release an open source package written in Java “… to convert file
types to plain text and a CF custom tag to interface to Lucence which will
search them.

Today, Scott piped in with a nice addition that adds the score to the query returned to the calling tag:

// Define column indexes
String[] columns = { "URL", "TITLE", "SUMMARY", "SCORE" } ;

// loop over all the results, add each to the query
for (int i = 0; i
For those of you like Scott who want to index PDF and Office documents, I'd suggest you start taking a look at these JGURU FAQ's:

Java Guru: How can I index PDF documents?

Java Guru: How can I index Word documents?

Cheers!

C# Indexers

At Mindseye we’ve written a content management system, which is really just the net result of writing and improving upon a modular code base for a bunch of different websites in a variety of programming languages (ASP.NET, ASP, ColdFusion, and Java). In this content management system, which we’ve affectionately called ‘Element’, a website is distilled down in various ‘objects’; things like events, newsletters, products, etc. Long story short, each object (in whatever language) is usually represented as a specific type and is represented internally as an XML document. In the .NET/C# version that I’ve been working with lately, a newsletter would look vaguely like this:

public class Newsletter {
  private XmlDocument newsletter;
  public Newsletter(XmlDocument doc) {
   this.newsletter = doc;
  }
  // other methods left out
}

Putting aside your opinion on this being a good or bad design for a second, I’ve always struggled with how to best make the elements of the encapsulated xml document available to external classes. Right now I have a method:

public string GetProperty(string label) {
  // retrieve the appropriate element and return as string
  return theValue;
}

and this works pretty well, but it’s lengthy. Another way of doing it would be to make each element of the XmlDocument a public property, but this would require alot of typing and would require that you recompile the class everytime the data structure it represented changed. So tonight, during nerd time (ie: extended time by myself at Barnes and Noble) I read about C# indexers. You’ve probably used an indexer before; for instance the NameValueCollection class contains a public property called Item, which itself is a indexer for a specific entry in the NameValueCollection. Unbeknownst to me before tonight, you can create your own indexers, so instead of having to access an element of an object like this:

string newsletterLabel = newsletterInstance.GetProperty("label");

you could instead use this syntax:

string newsletterLabel = newsletterInstance["label"];

which just feels more natural to me. Implementing the indexer in your class is simple. Using the example ‘Newsletter’ class above:

public class Newsletter {
  private XmlDocument newsletter;
  public Newsletter(XmlDocument doc) {
   this.newsletter = doc;
  }
  // indexer
  public string this [string index] {
    get {
      // logic to retrieve appropriate element from xmldoc by name
      return theValue;
    }
  }
}

I’m guessing that generally the index will be an integer rather than the string that I have above, nothing much changes if the parameter is integer:

public string this [int index] {
  get {
    // logic to retrieve appropriate element from xmldoc by index
    return theValue;
  }
}

Extremely handy stuff to know! Couple more thoughts on the subject:

· Indexers
· Comparison Between Properties and Indexers
· Properties
· Developer.com: Using Indexers

C# static constructors, destructors

Spent some more time on the C# logging class I’ve been working on. Per Joe’s suggestions, I modified the StreamWriter so that it is now a class variable and as such, it need not be initialized every time the class is used. Instead, you can use a static constructor (the idea exists in both C# and Java, although they go by different names). In C#, you simply append the keyword ‘static’ to the class name:

public class Logger {
 static Logger() {
  // insert static resource initialization code here
 }
}

In Java it’s called a static initialization block (more here) and it looks like this:

public class Logger {
 static {
  // insert static resource initialization code here
 }
}

If you’d like a real life example of a Java static initialization block, check out the source for the LogManager class in the log4j package.

Anyway, now the Logger class declares a StreamWriter:

private static StreamWriter sw;

and then uses the static constructor to initialize it for writing to the text file:

static Logger() {
  // load the logging path
  // ...
  // if the file doesn't exist, create it
  // ...
  // open up the streamwriter for writing..
  sw = File.AppendText(logDirectory);
}

Then use the lock keyword when writing to the resource to make sure that multiple threads can access the resource:

 ...
 lock(sw) {
   sw.Write("\r\nLog Entry : ");
   ...
   sw.Flush();
 }

Now all objects that call the various static methods will be using the same private streamwriter. But you’re left with one more problem. The streamwriter is never explicitly closed. If the StreamWriter was an instance variable, then we could solve this by implementing a destructor. The destructor would take this form:

~ Logger() {
  try {
   sw.Close();
 } catch {
   // do nothing, exit..
  }
}

However, in this case the StreamWriter is a static/class variable, no ‘instance’ of Logger ever exists in the system, the ~Logger destructor will never get called. Instead, when the StreamWriter is eligible for destruction the garbage collector runs the StreamWriter’s Finalize method (which itself will then presumably call the Close() method of the StreamWriter instance), which will then automatically free up the resources used by the StreamWriter.

I updated the Logger class and it’s personal testing assistant TestLogger (which has also been updated to use 3 threads). You can download them here:

· Logger.cs
· TestLogger.cs

Cross site scripting: removing meta-characters from user-supplied data in CGI scripts using C#, Java and ASP

Ran into some issues with cross site scripting attacks today. CERT® has an excellent article that show exactly how you should be filtering input from forms. Specifically, it mentions that just filtering *certain* characters in user supplied input isn’t good enough. Developers should be doing the opposite and only explicitly allowing certain characters. Using

… this method, the programmer determines which characters should NOT be present in the user-supplied data and removes them. The problem with this approach is that it requires the programmer to predict all possible inputs that could possibly be misused. If the user uses input not predicted by the programmer, then there is the possibility that the script may be used in a manner not intended by the programmer.

They go on to show a examples of proper usage in both C and Perl, but who uses C and Perl? đŸ˜‰ Here are the same examples in C#, Java and ASP.

In C#, you’ll make use of the Regex class, which lives in the System.Text.RegularExpressions namespace. I left out the import statements for succinctness here (you can download the entire class using the links at the end of this post), but you simply create a new Regex object supplying the regular expression pattern you want to look for as an argument to the constructor. In this case, the regular expression is looking for any characters not A-Z, a-z, 0-9, the ‘@’ sign, a period, an apostrophe, a space, an underscore or a dash. If it finds any characters not in that list, then it replaces them with an underscore.

public static String Filter(String userInput) {
  Regex re = new Regex("([^A-Za-z0-9@.' _-]+)");
  String filtered = re.Replace(userInput, "_");
  return filtered;
}

In Java it’s even easier. Java 1.4 has a regular expression package (which you can read about here) but you don’t even need to use it. The Java String class contains a couple methods that take a regular expression pattern as an argument. In this example I’m using the replaceAll(String regex, String replacement) method:

public static String Filter(String userInput) {
  String filtered = userInput.replaceAll("([^A-Za-z0-9@.' _-]+)", "_");
  return filtered;
}

Finally, in ASP (VBScript) you’d use the RegExp object in a function like this:

Function InputFilter(userInput)
  Dim newString, regEx
  Set regEx = New RegExp
  regEx.Pattern = "([^A-Za-z0-9@.' _-]+)"
  regEx.IgnoreCase = True
  regEx.Global = True
  newString = regEx.Replace(userInput, "")
  Set regEx = nothing
  InputFilter = newString
End Function

I think the next logical step would to be write a Servlet filter for Java that analyzes the request scope and automatically filters user input for you, much like the automatic request validation that happens in ASP.NET.

You can download the full code for each of the above examples here:

· InputFilter.cs
· InputFilter.java
· InputFilter.asp

Feel free to comment on the way that you do cross site scripting filtering.

Lightweight Languages Workshop at MIT

Fun stuff going on at MIT in a couple days:

LL3 will be an intense, exciting, one-day forum bringing together the best programming language implementors and researchers, from both academia and industry, to exchange ideas and information, to challenge one another, and to learn from one another.

The workshop series focuses on programming languages, tools, and processes that are usable and useful. Lightweight languages have been an effective vehicle for introducing new features to mainstream programmers.

More information here.