Cross site scripting: removing meta-characters from user-supplied data in CGI scripts using C#, Java and ASP

Ran into some issues with cross site scripting attacks today. CERT® has an excellent article that show exactly how you should be filtering input from forms. Specifically, it mentions that just filtering *certain* characters in user supplied input isn’t good enough. Developers should be doing the opposite and only explicitly allowing certain characters. Using

… this method, the programmer determines which characters should NOT be present in the user-supplied data and removes them. The problem with this approach is that it requires the programmer to predict all possible inputs that could possibly be misused. If the user uses input not predicted by the programmer, then there is the possibility that the script may be used in a manner not intended by the programmer.

They go on to show a examples of proper usage in both C and Perl, but who uses C and Perl? ;) Here are the same examples in C#, Java and ASP.

In C#, you’ll make use of the Regex class, which lives in the System.Text.RegularExpressions namespace. I left out the import statements for succinctness here (you can download the entire class using the links at the end of this post), but you simply create a new Regex object supplying the regular expression pattern you want to look for as an argument to the constructor. In this case, the regular expression is looking for any characters not A-Z, a-z, 0-9, the ‘@’ sign, a period, an apostrophe, a space, an underscore or a dash. If it finds any characters not in that list, then it replaces them with an underscore.

public static String Filter(String userInput) {
  Regex re = new Regex("([^A-Za-z0-9@.' _-]+)");
  String filtered = re.Replace(userInput, "_");
  return filtered;
}

In Java it’s even easier. Java 1.4 has a regular expression package (which you can read about here) but you don’t even need to use it. The Java String class contains a couple methods that take a regular expression pattern as an argument. In this example I’m using the replaceAll(String regex, String replacement) method:

public static String Filter(String userInput) {
  String filtered = userInput.replaceAll("([^A-Za-z0-9@.' _-]+)", "_");
  return filtered;
}

Finally, in ASP (VBScript) you’d use the RegExp object in a function like this:

Function InputFilter(userInput)
  Dim newString, regEx
  Set regEx = New RegExp
  regEx.Pattern = "([^A-Za-z0-9@.' _-]+)"
  regEx.IgnoreCase = True
  regEx.Global = True
  newString = regEx.Replace(userInput, "")
  Set regEx = nothing
  InputFilter = newString
End Function

I think the next logical step would to be write a Servlet filter for Java that analyzes the request scope and automatically filters user input for you, much like the automatic request validation that happens in ASP.NET.

You can download the full code for each of the above examples here:

· InputFilter.cs
· InputFilter.java
· InputFilter.asp

Feel free to comment on the way that you do cross site scripting filtering.

Lightweight Languages Workshop at MIT

Fun stuff going on at MIT in a couple days:

LL3 will be an intense, exciting, one-day forum bringing together the best programming language implementors and researchers, from both academia and industry, to exchange ideas and information, to challenge one another, and to learn from one another.

The workshop series focuses on programming languages, tools, and processes that are usable and useful. Lightweight languages have been an effective vehicle for introducing new features to mainstream programmers.

More information here.

The intricacies of HTTP

I’ve been working on a small piece of C# software this week that posts data to an HTTP server (which handles credit card processing), parses the results and then returns the results to a C# client. Pretty easy to do, right? First you create a HttpWebRequest object:

String url = "http://server/path";
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

and then you post the data:

byte[] requestBytes = System.Text.Encoding.ASCII.GetBytes (some_data);
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";
req.ContentLength = requestBytes.Length;
Stream requestStream = req.GetRequestStream();
requestStream.Write(requestBytes,0,requestBytes.Length);
requestStream.Close();

Finally, you retrieve the HTML returned from the server:

// note: exception handling removed for easier reading
StreamReader sr = null;
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
sr = new StreamReader(res.GetResponseStream(), System.Text.Encoding.ASCII);
String line = streamReader.ReadToEnd();
streamReader.Close();

The reason that I was working on it was that the application was returning random exceptions of the form:

Error reading response stream: System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a receive.

Googling for this error message didn’t leave me with much. There were a smattering of posts on various web forums about the error, but not a whole lot of solutions. Long story short, I fired up TcpTrace and modified the KeepAlive property (setting it to false) of the HttpWebRequest object on a whim and voila! The application worked again. Best I can tell the HTTP server I’m working against doesn’t handle HTTP posts using Connection: Keep-Alive properly. For whatever reason it decides that the third request in a Keep-Alive connection should be closed.

Broadly, the reason I bring this up is because I think it’s important for all web developers to have an in-depth understanding of what’s going on under the hood of HTTP. Knowing the advantages and disadvantages of things like the HTTP Keep-Alive header becomes invaluable whenever you have to drop down to manually sending and receiving HTTP.

More pointedly, it was interesting to find out a couple tidbits about how .NET handles HTTP connections. First, by default .NET is configured (via machine.config) to use whatever proxy settings you have for Internet Explorer. You can turn this off by modifying the:

 configuration/system.net/defaultProxy/proxy

element. Second, also by default, machine.config only allows .NET applications to make 2 persistent connections to external resources. You can modify/view this as well:

 configuration/system.net/connectionManagement

Finally, the HttpWebRequest and it’s parent WebRequest again, by default, are set to use Keep-Alive connections.

Logging in C#: enumerations, thread-safe StreamWriter

Joe gave me some great feedback on the C# logging utility I wrote about a couple months ago. Per his suggestions, I modified it in the following ways:

1) Instead of using public static int variables as levels, I added an enumeration:

enum Priority : int {
  OFF = 200,
  DEBUG = 100,
  INFO = 75,
  WARN = 50,
  ERROR = 25,
  FATAL = 0
}

An enumeration is a value type (ie: the enumeration is not a full fledged object) and thus is allocated on the stack. I’m guessing that Joe suggested the use of an enumeration for 2 reasons. First, an enumeration groups the constants together… in some sense it encapsulates what was a group of unrelated static integers into a single type, in this case named ‘Priority’. Second, because enumerations are value types (and thus are allocated on the stack), they require less resources from both the processor and memory on which the application is running.

2) Joe mentioned “… you probably need to put a lock{} around the calls to it (StreamWriter) –it’s not guaranteed to be threadsafe.“. Turns out he’s right (not that it was a surprise). The StreamWriter documentation has this to say: “Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.” But the solution is easier than putting a lock{} on it. StreamWriter extends the TextWriter class, which itself has a static method for generating a thread safe wrapper. So where in the first version I had this:

StreamWriter sw = File.AppendText(filePath);

I now have this:

TextWriter tw = TextWriter.Synchronized(File.AppendText(filePath));

The File.AppendText() method returns a StreamWriter object, which the TextWriter.Synchronized() method wraps to create a thread-safe TextWriter which can be used just like a StreamWriter.

3) I noticed that the log4j implementation uses wrapper methods to make the argument list shorter. For instance, the Logger class has methods that look like this:

public void debug(Object message);
public void info(Object message);
public void warn(Object message);
public void error(Object message);
public void fatal(Object message);

I added the same idiom to my Logger class:

public static void Debug(String message) {
  Logger.Append(message, (int)Priority.DEBUG);
}

while still allowing for the more verbose:

public static void Append(String message, int level)

I uploaded the source and a test so you all can have a hack at it, if that kind of thing toots your horn:
· Logger.cs
· TestLogger.cs
I’m *always* open to comments and feedback. If you have even an inkling as to what I could do better with this code, *please* add your thoughts below.

Fail-Safe Amazon Image… using Java, C# & ColdFusion

Paul of onfocus.com fame (and the fabulous SnapGallery tool) wrote an article for the O’Reilly Network recently that (I think) was an excerpt of his recently released book “Amazon Hacks“. Anyway, he shows how you can check to see if an image exists on amazon.com using ASP, Perl, and PHP and I thought it would be fun to show how to do the same thing in Java, C# and ColdFusion. His examples were all functions of the form:

Function hasImage(imageUrl)

so I’m following that style. In Java you’d end up with something like this:

public static boolean hasImage(String url) {
boolean result = false;
  try {
    URL iurl = new URL(url);
    HttpURLConnection uc = (HttpURLConnection)iurl.openConnection();
    uc.connect();
    if (uc.getContentType().equalsIgnoreCase("image/jpeg")) {
      result = true;
    }
    uc.disconnect();
  } catch (Exception e) {
  }
  return result;
}

In C#, almost the exact same thing:

public static Boolean HasImage(String url) {
  Boolean result = false;
  try {
    HttpWebRequest webreq = (HttpWebRequest)WebRequest.Create(url);
    WebResponse res = webreq.GetResponse();
    if (res.ContentType == "image/jpeg") {
      result = true;
    }
    response.Close();
  } catch {
  }
  return result;
}

and then in ColdFusion:

<cffunction name="hasImage" returntype="boolean" output="no">
  <cfargument name="imageUrl" type="string" required="yes">
  <cfhttp url="#imageURL#" method="GET">
  <cfif cfhttp.responseHeader["Content-Type"] EQ "image/jpeg">
    <cfreturn true>
  <cfelse>
    <cfreturn false>
  </cfif>
</cffunction>

The full source for all these examples are available:

· Amazon.java
· Amazon.cs
· amazon.cfm

Enjoy!

Custom string formatting in C#

Formatting strings for output into various mediums is always a fun… err.. required task. Every language does it differently. C# overloads the ToString() method to format a string using this syntax:

Console.WriteLine(MyDouble.ToString("C"));

where “C” is a format specifier specifically for locale specific currency formatting. If the variable ‘MyDouble’ was 3456 in the example above, you’d see:

$3546.00

printed out. Of course, the fun doesn’t end there. There are a whole boatload of standard numeric formatting specifiers you can use including decimal, number, percent and hexadecimal. But truly the most fun are the custom numeric format strings. Example: Let’s say that your boss wants you to format all product pricing rounded to the nearest dollar without using a commas (ie: $1224 instead of $1,224.00). Normally, you’d write:

Price: <%= Product.Price.ToString("C") %>

but since you don’t want to have commas, you can use a custom format string:

Price: <%= Product.Price.ToString("$#####") %>

which will produce this:

Price: $1224

How about phone numbers? Don’t they just suck to format? In ColdFusion, you’d have something like this:

(#left(str, 3)#) #mid(str, 4, 3)# - #right(str, 4)#

where ‘str’ is a string containing the 10 digit phone number. In C#, you can write this:

phone.ToString("(###) ### - ####");

Pretty concise isn’t it?

Spidering Hacks

I fielded a couple questions this week about search engine safe URL’s both of them along of the lines of a) how do you create them? and b) are they even worth it? I’m written about how you can create them using Apache before, but one of the things I didn’t mention was that I think writing your own spider.. or at least attempting to, is a great first step to understanding why search engine safe URL’s are important. To that end, I’d suggest the “Spidering Hacks” book that Oreilly just released as a great starting point. The book uses Perl quite extensively, but it’s the process that matters. I’ve picked up “Programming Spiders, Bots, and Aggregators in Java” at Barnes and Noble quite a few times as well, but have never pulled the trigger.

If you’d rather read code, you can download the spider/indexing engine I’ve been working on (was working on!) to get some kind of idea of what goes into a spider.

Martin Fowler @ NEJUG: Software Design in the Twenty-first Century

I attended the NEJUG meeting in Lowell last week that Martin Fowler spoke at. I was the guy in the back furiously typing notes, which I’m presenting for your pleasure here, revised and polished.

Martin started out by saying that he didn’t know exactly what to talk about, and then he launched into a discussion abou the completely new version of UML very near to completion, UML 2.0.

· newest version of uml has alot of revisions to the metamodel,
— in lieu of people thinking that uml is all about diagrams and people not really caring all that much about diagrams
— 3 ways in which uml is commonly used: sketches, blueprints and langauges

· sketches
— draw diagrams to communicate ideas about software
— not precise, just an overall picture
— main idea is to convey ‘ideas’
— most books that use UML are completely wrong in their use of uml (and almost all are sketches)
— martin’s favorite uml drawing tool? a printing whiteboard
— no whiteboard? then use visio (templates are available on martin’s webpage of links)
togetherJ (an ide?) has built in UML support for sketching

· blueprints
— software development as an engineering process
— martin doesn’t agree w/ this metaphor
— one body of people who “design” and make all the important design decisions, hand off to another group for implementation
— reality is that no one cares if the software matches the diagram
— in real civil engineering the designers check to make sure that the end result matches the original design
— both parties need to be intimately familiar with the intracacies of the UML specification for blueprints to work

· uml as a programming langauage
— “excecutable UML”
— “model driven architecture”
— graphics matter less, the meta model takes precedence in this way of thinking

· the people that are driving the standard are both heavily involved with the blueprints and Uml as a programming language direction…, which means that not many people are thinking about people that use uml as a sketching tool
— thus, almost all the changes in uml2 are for uml as a programming language
— these people think that in 10/20 years no one will be programming in java/c#, but rather in UML
— CASE people said the same thing in the 80′s

couple arguments that might make this possible
a) UML is a standard where case is a company (examples: SQL, Java..)
   1) however, it’s a pretty complex standard and not everything is agree upon
   2) alot of things are open to interpretation
   3) subsets of UML aren’t made all for execeutable UML
   4) digrams can’t be transferred from one tool to another without a loss of data

b) malapropism for platform indepedence
— build a platform independent model (without regard to programming language)
— then do a build to a platform specific model that would then build the source code
— uml people think of platforms as “java” or “.net”

c) uml hasn’t yet approached the discussion of the libraries
— ie: it can’t write “hello world” yet

sequence diagrams were so easy that anyone could understand just by looking

interaction diagrams, which are new to 2.0, are much more complicated

mf doesn’t think that code generation will *really* be all that much more productive than regular programming in Java or C#?

mf quotes an engineer who said: “engineering is different from programming in that engineering has tolerance while programming is absolute.”

uml’s success as a programming language hinges on it’s ability to make people more productive

mf thinks that there will be a growing divergence between the sketchers and the blueprinters/executeable people

prescriptive vs. descriptive
— uml is increasing becoming descriptive, not prescriptive

structural engineers use no “standard” drawing diagrams, but simply follow “accepted” rules… a trend that MF thinks that UML will probably folllow, we’ll all use UML, but not necessarily according to the specification

questions that came from the audience