Category Archives: Software Development

java.lang.IllegalArgumentException: Illegal group reference, replaceAll and dollar signs

This weblog is officially about inane things I run into while trying to do my job at work. Let’s say you have a String object like this:

String mystring = "Your password: #PASSWORD";

and at runtime you need to replace the value of #PASSWORD with a password that a user typed in. You’d write something like this:

String password = "$Jslwe"
mystring = mystring.replaceAll("#PASSWORD", password);

What would happen? You’d expect that the key #PASSWORD would get replaced with the value of the variable ‘password’ (which is “$Jslwe”) and then you’d move happily on your way to something much more interesting. But no, Java throws you an error:

java.lang.IllegalArgumentException: Illegal group reference

which is extremely helpful. Turns out that the second argument to the String replaceAll method “may” have some issues with dollar signs and backslashes which you only find out about if you dig into the Matcher class that backs the replaceAll method or if you’re lucky and you read about the whole thing on a site devoted to regular expressions. In short:

myString.replaceAll(“regex”, “replacement”) replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal “2” if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal “2” in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: “\\$”.

CIS 552: Database Design

I took the final exam today in my CIS 552 Database Design course. I think I did ok on the exam but more importantly, I learned a lot this semester. I think the main takeway from the course is that it helps to realize that a database application (meaning an application like MySQL, Oracle or SQL Server, not a database instance like Northwind) is more than just SQL, in fact the bottom level of a database management system stack is “just” reading and writing files to the file system, albeit with transactions and atomicity and multiple users and well, a bunch of other stuff but bear with me. See, when you first start out, you do things like SELECT * FROM table and then you read somewhere how inefficient it is to bring back all the records from your table and so you use SELECT id FROM table. And then pretty soon you hit a wall where you have a query that takes like 5 seconds to run and you learn about indexes. Well, if you’re like me, you just learn that indexes make things go faster… you might read SQL Server books online and see that an index in SQL Server is a B-tree, but you don’t go any further than that and so you just come away knowing that indexes make your application run fastertm. After that, you read find someone somewhere that says that the order of your selection criteria matters and that you should try and give hints to the query optimizer when in fact, the query optimizer can do whatever the hell it wants with your declarative SQL statement and probably will.

And then you take a class like this and you learn how a database management system stores records in pages and you get to write some code that stores and retrieves bytes to and from said pages, all the while paying attention to the page header which stored (in our case) the page offset, the number of records in the page and a record size / pointer data structure, all of which is just part of a big file. After that you write code that looks up records in an index, which turns out to be a file as well. Finally, you use a block nested loop to process a join query and it turns out that again, you’re just reading bytes from a file system.

So now I understand why the guys at viaweb didn’t use a database (although I’d never go there myself) and I can have a somewhat intelligent conversation about the differences between a linear search, a binary search, an index scan, a B-tree index seek, and a hash index lookup. And oh man do I wish I was a math major in college.

Other cool stuff I saw along the way:

· On the Goodness of Binary Search by Tim Bray

· General Purpose Hash Function Algorithms

· Binary logarithm

· Google SparseHash a project that contains several hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for space and one that optimizes for speed.

· The Anatomy of a Large-Scale Hypertextual Web Search Engine: “Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.” (emphasis mine)

· Lucene architecture

· Quicksort

SQL Server Group By Datetime

If you’ve ever used SQL Server, you know that there’s no such thing as a Date or Time, only Datetime. According to SQL Server Books Online: “If only a time is specified when setting a datetime or smalldatetime value, the date defaults to January 1, 1900. If only a date is specified, the time defaults to 12:00 A.M. (Midnight).”, which is all is fine and good but it becomes an issue when you want just the date or just the time from a column that stores both. Say for instance that my boss wants to see the number of orders per day over the last couple years to see which days have the most orders (‘cyber monday!‘). Unfortunately, there’s no function called date() or time() which returns just the date portion of the datetime or just the time portion of the datetime:

-- doesn't work...
SELECT date(mydate) as thedate, count(id) as perday
FROM orders
GROUP by thedate
ORDER by perday DESC

Turns out there’s a hack that does though:

SELECT distinct convert(varchar,mydate,111) as thedate, count(id) as perday
FROM orders
group by convert(varchar,mydate,111)
order by perday DESC

Hope that makes your day.

Java, JTDS, PreparedStatement and varchar

I’ve been working on an interesting application at work that needs to be fast, the faster the better in fact. I wrote a couple quick and dirty implementations in my scratchpad in Eclipse and I figured that I could get about fifty operations per second (a database UPDATE is involved for every operation among other things). Anyway, I went to develop a full implementation and a then ran a full test of about 100,000 operations. Instead of taking about 30 minutes (100,000 operations / 50 per second = ~ 30 minutes) the operation took about 7 hours. I was getting about 4 operations per second throughput, which was obviously a huge disappointment. The pseudocode I wrote originally looked something like this:

Connection c = DriverManager.getConnection(cs);
String q = "UPDATE mytable SET x = 1 WHERE id = ?";
PreparedStatement p = c.prepareStatement(q);
for (int i=0; i

and it worked well. I made a single change during development: instead of using the ‘id ‘ column of the database table (a numeric 9 byte primary key and thus is the clustered index for the table) I used a 13 byte varchar column as the identifier which had a nonclustered index, my code looked like this:

Connection c = DriverManager.getConnection(cs);
String q = "UPDATE mytable SET x = 1 WHERE y = ?";
PreparedStatement p = c.prepareStatement(q);
for (int i=0; i

The nonclustered index performed just as well as the clustered index: in my testing an UPDATE statement using the varchar column as the constraint in the query worked just as fast as the primary key / clustered index, which makes sense because index seeks (which I learned about in my database design class this semester) on a 9 byte / 72 bit numeric value (because I used a precision of 19 digits) should be similar to index seeks on a 13 byte / 104 bit varchar column. So then I executed the finished program (not the test) and brought up SQL Profiler (a tool that ships with SQL Server that can debug, troubleshoot, monitor, and measure your application’s SQL statements and stored procedures). It quickly became clear what the problem was. Here’s the SQL created by the prepareStatement() method:

create proc #jtds000001 @P0 varchar(4000) as UPDATE mytable SET x = 1 WHERE y = @P0

and then the executeUpdate() method:

exec #jtds000001 N'005QDUKS1MG8K'

See the problem? The JTDS driver turned the 13 byte varchar column into a 4000 byte varchar column (the maximum number of bytes for a column) and then prefixed the parameter with ‘n’, which is used to identify Unicode data types. This substitution caused the query processor to ignore the index on ‘y’ and do an index scan instead of an index seek.

Here’s where is gets fun. Microsoft SQL Server uses a B-tree index structure (also on wikipedia), which is similar to a B+tree, except that search key values can only appear once in the tree. Objects are stored in SQL Server as a collection of 8KB pages and (because of the class I’ve been taking) I now know that you can compute the approximate number of disk IO’s for an index seek as:

logn/2(k)

where n is the number of keys per node and k is the number of search keys. So with one million search keys and 8KB pages in SQL Server, a index on a 13 byte key would create a tree with about 615 nodes (~8000 / 13 = ~615). Thus the index seek in my system was costing about log615/2(1000000) = 2.4 node accesses (one node access ~= one disk IO) versus an index scan (615 nodes @ 8KB each, figure that on average over time we’ll find the value in 615/2 so ~307 node accesses?) which is significantly longer and obviously the cause of the problem.

Moral of the story: watch out for char / varchar constraint parameters when using JTDS and a PreparedStatement. Also, indexes are A Good ThingTM.

Updated 12/04/2005: Brian Heineman (one of the maintainers of the JTDS project) points out that this is a feature, not a bug. He also points out that you can work around the issue by appending:

sendStringParametersAsUnicode=false;

to your database connection string (I tested it out and it works just as advertised). Since the real issue is that JTDS can’t tell if the String instance I’m sending is Unicode or not and so defaults to a Unicode string, the other workaround would be to use the setBytes() method of the PreparedStatement and the use the byte[] representation of the String. From my example above:

p.setBytes(1, somearray[i].getBytes());

Ruby on Rails in the Java community

Couple of months ago I attended the No Fluff Just Stuff conference up in Framingham, I took a bunch of notes which I intented to post to this blog, but never got around to it. The conference tag line is “The best value in the Java / Open Source conferencing space hands down” and I’d have to agree, although the emphasis on Ruby on Rails was surprising. Turns out that a number of the speakers who make their living consulting and writing books about Java have taken up Ruby on Rails and so maybe 25% of the sessions were about Ruby on Rails (the session by Dave Thomas was maybe one of the best conference sessions I’ve ever been too). I guess all this is to say that it’s not a surprise that the next ACM WebTech group meeting in Waltham is going to be about Ruby on Rails.

Subversion + Ant Release Scripts

I’m not sure where this script fits into ‘The Joel Test’, but in the interest of automating the process of releasing a build in Subversion, I had the new guy (hey Pete!) spend his first couple days writing a script that:

  1. copies the trunk to the named branch (ie: /myproject/trunk –> /branches/1.10)
  2. copies the ‘latest’ tag to a tag called ‘rollback’
  3. copies the newly created branch to /tags/latest

The end result is that when we decide to do a release to the production environment, we can simply run the script (download the plain text) which copies the Subversion trunk (the main line of development for those unfamiliar with Subversion) to a tag we called ‘latest’ and also to a branch which matches the version number of the release. We point our production servers to /$project/tags/latest and our development servers (which we setup to run nightly builds) to /$project/trunk.

And just so we know that everything is running smoothly, he modified the Ant deployment script so that it sends an email upon completion indicating success or failure along with the output from the build. The email part I thought was going to be relatively simple (ie: use the mail task), but I wasn’t so sure about the generated output. Turns out that you can use MailLogger feature to listen for and get the status of build events simply by appending a logger flag to the invocation of ant:

ant -logger org.apache.tools.ant.listener.MailLogger

and then by setting the appropriate properties in your build file.

Database Meaning

Rafe Colburn pointed to a blog posting by David Heinemeier Hansson where he that you should keep your business logic in your business layer rather than in your database which then lead to a blog post by Martin Fowler entitled Database Styles. Martin says he has only one principal point but he I think he made two good points. First (as he mentioned), there are generally two styles of databases: application databases (a database controlled and accessed by a single application) and integration databases (which acts as a data store for multiple applications) and that when you enter a discussion about THE ‘database’, you need to make it clear which type of database you’re talking about. His second point is that SOA (I’m not a fan of the acronym, but I like the implementations) can make and maybe should make integration databases unnecessary. Instead of having multiple applications interop through a common datastore (which is really nothing more than a gigantic global variable isn’t it?), each application maintains it’s own application database and communicates through a service interface like SOAP, REST, JMS, etc.

Rafe went on to make a good point as well (or maybe just a point that I agree with). He said that the one exception he makes to the rule of logic in the business layer rather than the database is the use of constraints. A commenter summed it up nicely by saying that “… constraints are to your data what assertions and tests are to your code.”

Update to embedded Axis application in Tomcat

I got a great email from Tamás in response to my last post who pointed out that the straight copy of deploy.wsdd to server-config.wsdd doesn’t cut it. More importantly, he mentioned that there is a utility that ships with Axis that allows you to generate server-config.wsdd from your deploy.wsdd (or from multiple deploy.wsdd if you have multiple web service end points). From the command line it looks like this:

> java -cp axis.jar;jaxrpc.jar;commons-logging.jar;commons-discovery.jar;saaj.jar;
org.apache.axis.utils.Admin server dir1\deploy.wsdd dir2\deploy.wsdd

But if you’re using the Ant build.xml I provided in the previous example, you’d use this:

<java
  classname="org.apache.axis.utils.Admin"
  fork="true"
  failonerror="true"
  classpathref="compile.classpath"
  dir="${basedir}\WEB-INF\">
   <arg value="server" />
    <arg file="${basedir}\deploy.wsdd" />
</java>

I updated the source code example (embeddedaxis.zip), you can download it here.

NOTE: The source code for the Admin class is available here, where you can see (but the documentation doesn’t mention) that the Admin class accepts multiple WSDD files from the command line.

Java, Collections and Multimap

I was in an interview recently and was asked a question which I thought the interviewer called an ‘atagram’, but I think it was actually an anagram. He asked how you could find the largest word in a dictionary where subtracting one letter results in another word (ie: ‘beat’ minus ‘e’ could be ‘tab’). I didn’t come up with an suitable answer during the interview, but during some unrelated reading this week I came across the question and the answer: multimap. Scroll down to the bottom of this page and start reading when you get to multimaps. The trick is that a multimap allows one key to map to multiple values and by alphabetizing each word in the dictionary and then placing the word in the map keyed by the alphabetized word, you can easily find all the available words which result from word minus letter.

XML characters, smart quotes and Apache XML-RPC

I’ve been eating my own dogfood with the deliciousposter project (as you can see from my daily links). A couple days ago I posted a some links to del.icio.us and expected them to show up automatically the next day… except they didn’t. I traced it down to an errant smart quote that I copied from the Internet Alchemy Talis, Web 2.0 and All That post, which caused the Apache XML-RPC library to throw this error:

java.io.IOException: Invalid character data corresponding to XML entity ’

I worked under the assumption that the smart quote was an invalid XML character for quite awhile, but it looks like it actually is according to the XML 1.1 specification, the following characters are allowed in an XML document:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I then checked the source code for the XmlWriter which has this method for writing character data:

...
if (c < 0x20 || c > 0xff) {
  // Though the XML-RPC spec allows any ASCII
  // characters except '<' and '&', the XML spec
  // does not allow this range of characters,
  // resulting in a parse error from most XML
  // parsers.
  throw new XmlRpcClientException("Invalid character data " +
  "corresponding to XML entity &#" +
  String.valueOf((int) c) + ';', null);
} else ..

which turns out to be a tad aggressive. It also turns out that the above code snippet and the version of the Apache XML-RPC library I was using are out of date. The chardata(String text) has been updated in the latest version of the Apache XMl-RPC library to include a new method called isValidXMLChar(char c) which is much more lenient:

if (c == '\n') return true;
if (c == '\r') return true;
if (c == '\t') return true;
if (c
and not coincidentally, is compliant with the specification.

I'll be updating deliciousposter to use the latest version of the Apache XML-RPC library soon. In the meantime, if you're using the Apache XML-RPC library, you should probably download the latest version to take advantage of the new XML character validation method.