Category Archives: Software Development

java.lang.IllegalArgumentException: Illegal group reference, replaceAll and dollar signs

February 9, 2006 ajohnson 39 Comments

This weblog is officially about inane things I run into while trying to do my job at work. Let’s say you have a String object like this:
String mystring = "Your password: #PASSWORD";
and at runtime you need to replace the value of #PASSWORD with a password that a user typed in. You’d write something like this:
String password = "$Jslwe" mystring = mystring.replaceAll("#PASSWORD", password);
What would happen? You’d expect that the key #PASSWORD would get replaced with the value of the variable ‘password’ (which is “$Jslwe”) and then you’d move happily on your way to something much more interesting. But no, Java throws you an error:
java.lang.IllegalArgumentException: Illegal group reference
which is extremely helpful. Turns out that the second argument to the String replaceAll method “may” have some issues with dollar signs and backslashes which you only find out about if you dig into the Matcher class that backs the replaceAll method or if you’re lucky and you read about the whole thing on a site devoted to regular expressions. In short:

myString.replaceAll(“regex”, “replacement”) replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal “2” if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal “2” in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: “\\$”.

Personal, Software Development

CIS 552: Database Design

December 14, 2005 ajohnson Leave a comment

I took the final exam today in my CIS 552 Database Design course. I think I did ok on the exam but more importantly, I learned a lot this semester. I think the main takeway from the course is that it helps to realize that a database application (meaning an application like MySQL, Oracle or SQL Server, not a database instance like Northwind) is more than just SQL, in fact the bottom level of a database management system stack is “just” reading and writing files to the file system, albeit with transactions and atomicity and multiple users and well, a bunch of other stuff but bear with me. See, when you first start out, you do things like SELECT * FROM table and then you read somewhere how inefficient it is to bring back all the records from your table and so you use SELECT id FROM table. And then pretty soon you hit a wall where you have a query that takes like 5 seconds to run and you learn about indexes. Well, if you’re like me, you just learn that indexes make things go faster… you might read SQL Server books online and see that an index in SQL Server is a B-tree, but you don’t go any further than that and so you just come away knowing that indexes make your application run faster^tm. After that, you read find someone somewhere that says that the order of your selection criteria matters and that you should try and give hints to the query optimizer when in fact, the query optimizer can do whatever the hell it wants with your declarative SQL statement and probably will.

And then you take a class like this and you learn how a database management system stores records in pages and you get to write some code that stores and retrieves bytes to and from said pages, all the while paying attention to the page header which stored (in our case) the page offset, the number of records in the page and a record size / pointer data structure, all of which is just part of a big file. After that you write code that looks up records in an index, which turns out to be a file as well. Finally, you use a block nested loop to process a join query and it turns out that again, you’re just reading bytes from a file system.

So now I understand why the guys at viaweb didn’t use a database (although I’d never go there myself) and I can have a somewhat intelligent conversation about the differences between a linear search, a binary search, an index scan, a B-tree index seek, and a hash index lookup. And oh man do I wish I was a math major in college.

Other cool stuff I saw along the way:

· On the Goodness of Binary Search by Tim Bray

· General Purpose Hash Function Algorithms

· Binary logarithm

· Google SparseHash a project that contains several hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for space and one that optimizes for speed.

· The Anatomy of a Large-Scale Hypertextual Web Search Engine: “Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.” (emphasis mine)

· Lucene architecture

· Quicksort

Software Development

SQL Server Group By Datetime

December 6, 2005 ajohnson 34 Comments

If you’ve ever used SQL Server, you know that there’s no such thing as a Date or Time, only Datetime. According to SQL Server Books Online: “If only a time is specified when setting a datetime or smalldatetime value, the date defaults to January 1, 1900. If only a date is specified, the time defaults to 12:00 A.M. (Midnight).”, which is all is fine and good but it becomes an issue when you want just the date or just the time from a column that stores both. Say for instance that my boss wants to see the number of orders per day over the last couple years to see which days have the most orders (‘cyber monday!‘). Unfortunately, there’s no function called date() or time() which returns just the date portion of the datetime or just the time portion of the datetime:
-- doesn't work... SELECT date(mydate) as thedate, count(id) as perday FROM orders GROUP by thedate ORDER by perday DESC
Turns out there’s a hack that does though:
SELECT distinct convert(varchar,mydate,111) as thedate, count(id) as perday FROM orders group by convert(varchar,mydate,111) order by perday DESC
Hope that makes your day.

J2EE, Software Development

Java, JTDS, PreparedStatement and varchar

December 2, 2005 ajohnson 6 Comments

I’ve been working on an interesting application at work that needs to be fast, the faster the better in fact. I wrote a couple quick and dirty implementations in my scratchpad in Eclipse and I figured that I could get about fifty operations per second (a database UPDATE is involved for every operation among other things). Anyway, I went to develop a full implementation and a then ran a full test of about 100,000 operations. Instead of taking about 30 minutes (100,000 operations / 50 per second = ~ 30 minutes) the operation took about 7 hours. I was getting about 4 operations per second throughput, which was obviously a huge disappointment. The pseudocode I wrote originally looked something like this:
Connection c = DriverManager.getConnection(cs); String q = "UPDATE mytable SET x = 1 WHERE id = ?"; PreparedStatement p = c.prepareStatement(q); for (int i=0; i
and it worked well. I made a single change during development: instead of using the ‘id ‘ column of the database table (a numeric 9 byte primary key and thus is the clustered index for the table) I used a 13 byte varchar column as the identifier which had a nonclustered index, my code looked like this:
Connection c = DriverManager.getConnection(cs); String q = "UPDATE mytable SET x = 1 WHERE y = ?"; PreparedStatement p = c.prepareStatement(q); for (int i=0; i
The nonclustered index performed just as well as the clustered index: in my testing an UPDATE statement using the varchar column as the constraint in the query worked just as fast as the primary key / clustered index, which makes sense because index seeks (which I learned about in my database design class this semester) on a 9 byte / 72 bit numeric value (because I used a precision of 19 digits) should be similar to index seeks on a 13 byte / 104 bit varchar column. So then I executed the finished program (not the test) and brought up SQL Profiler (a tool that ships with SQL Server that can debug, troubleshoot, monitor, and measure your application’s SQL statements and stored procedures). It quickly became clear what the problem was. Here’s the SQL created by the prepareStatement() method:
create proc #jtds000001 @P0 varchar(4000) as UPDATE mytable SET x = 1 WHERE y = @P0
and then the executeUpdate() method:
exec #jtds000001 N'005QDUKS1MG8K'
See the problem? The JTDS driver turned the 13 byte varchar column into a 4000 byte varchar column (the maximum number of bytes for a column) and then prefixed the parameter with ‘n’, which is used to identify Unicode data types. This substitution caused the query processor to ignore the index on ‘y’ and do an index scan instead of an index seek.

Here’s where is gets fun. Microsoft SQL Server uses a B-tree index structure (also on wikipedia), which is similar to a B+tree, except that search key values can only appear once in the tree. Objects are stored in SQL Server as a collection of 8KB pages and (because of the class I’ve been taking) I now know that you can compute the approximate number of disk IO’s for an index seek as:
log_n/2(k)
where n is the number of keys per node and k is the number of search keys. So with one million search keys and 8KB pages in SQL Server, a index on a 13 byte key would create a tree with about 615 nodes (~8000 / 13 = ~615). Thus the index seek in my system was costing about log_615/2(1000000) = 2.4 node accesses (one node access ~= one disk IO) versus an index scan (615 nodes @ 8KB each, figure that on average over time we’ll find the value in 615/2 so ~307 node accesses?) which is significantly longer and obviously the cause of the problem.

Moral of the story: watch out for char / varchar constraint parameters when using JTDS and a PreparedStatement. Also, indexes are A Good Thing^TM.

Updated 12/04/2005: Brian Heineman (one of the maintainers of the JTDS project) points out that this is a feature, not a bug. He also points out that you can work around the issue by appending:
sendStringParametersAsUnicode=false;
to your database connection string (I tested it out and it works just as advertised). Since the real issue is that JTDS can’t tell if the String instance I’m sending is Unicode or not and so defaults to a Unicode string, the other workaround would be to use the setBytes() method of the PreparedStatement and the use the byte[] representation of the String. From my example above:
p.setBytes(1, somearray[i].getBytes());

J2EE, Ruby on Rails, Software Development

Ruby on Rails in the Java community

November 12, 2005 ajohnson 2 Comments

Couple of months ago I attended the No Fluff Just Stuff conference up in Framingham, I took a bunch of notes which I intented to post to this blog, but never got around to it. The conference tag line is “The best value in the Java / Open Source conferencing space hands down” and I’d have to agree, although the emphasis on Ruby on Rails was surprising. Turns out that a number of the speakers who make their living consulting and writing books about Java have taken up Ruby on Rails and so maybe 25% of the sessions were about Ruby on Rails (the session by Dave Thomas was maybe one of the best conference sessions I’ve ever been too). I guess all this is to say that it’s not a surprise that the next ACM WebTech group meeting in Waltham is going to be about Ruby on Rails.

Software Development, Systems Administration

Subversion + Ant Release Scripts

October 11, 2005 ajohnson 1 Comment

I’m not sure where this script fits into ‘The Joel Test’, but in the interest of automating the process of releasing a build in Subversion, I had the new guy (hey Pete!) spend his first couple days writing a script that:

copies the trunk to the named branch (ie: /myproject/trunk –> /branches/1.10)
copies the ‘latest’ tag to a tag called ‘rollback’
copies the newly created branch to /tags/latest

The end result is that when we decide to do a release to the production environment, we can simply run the script (download the plain text) which copies the Subversion trunk (the main line of development for those unfamiliar with Subversion) to a tag we called ‘latest’ and also to a branch which matches the version number of the release. We point our production servers to /$project/tags/latest and our development servers (which we setup to run nightly builds) to /$project/trunk.

And just so we know that everything is running smoothly, he modified the Ant deployment script so that it sends an email upon completion indicating success or failure along with the output from the build. The email part I thought was going to be relatively simple (ie: use the mail task), but I wasn’t so sure about the generated output. Turns out that you can use MailLogger feature to listen for and get the status of build events simply by appending a logger flag to the invocation of ant:
ant -logger org.apache.tools.ant.listener.MailLogger
and then by setting the appropriate properties in your build file.

Software Development

Database Meaning

October 7, 2005 ajohnson Leave a comment

Rafe Colburn pointed to a blog posting by David Heinemeier Hansson where he that you should keep your business logic in your business layer rather than in your database which then lead to a blog post by Martin Fowler entitled Database Styles. Martin says he has only one principal point but he I think he made two good points. First (as he mentioned), there are generally two styles of databases: application databases (a database controlled and accessed by a single application) and integration databases (which acts as a data store for multiple applications) and that when you enter a discussion about THE ‘database’, you need to make it clear which type of database you’re talking about. His second point is that SOA (I’m not a fan of the acronym, but I like the implementations) can make and maybe should make integration databases unnecessary. Instead of having multiple applications interop through a common datastore (which is really nothing more than a gigantic global variable isn’t it?), each application maintains it’s own application database and communicates through a service interface like SOAP, REST, JMS, etc.

Rafe went on to make a good point as well (or maybe just a point that I agree with). He said that the one exception he makes to the rule of logic in the business layer rather than the database is the use of constraints. A commenter summed it up nicely by saying that “… constraints are to your data what assertions and tests are to your code.”

J2EE, Open Source, SOAP, Software Development

Update to embedded Axis application in Tomcat

September 13, 2005 ajohnson 1 Comment

I got a great email from Tamás in response to my last post who pointed out that the straight copy of deploy.wsdd to server-config.wsdd doesn’t cut it. More importantly, he mentioned that there is a utility that ships with Axis that allows you to generate server-config.wsdd from your deploy.wsdd (or from multiple deploy.wsdd if you have multiple web service end points). From the command line it looks like this:
> java -cp axis.jar;jaxrpc.jar;commons-logging.jar;commons-discovery.jar;saaj.jar; org.apache.axis.utils.Admin server dir1\deploy.wsdd dir2\deploy.wsdd
But if you’re using the Ant build.xml I provided in the previous example, you’d use this:
<java classname="org.apache.axis.utils.Admin" fork="true" failonerror="true" classpathref="compile.classpath" dir="${basedir}\WEB-INF\"> <arg value="server" /> <arg file="${basedir}\deploy.wsdd" /> </java>
I updated the source code example (embeddedaxis.zip), you can download it here.

NOTE: The source code for the Admin class is available here, where you can see (but the documentation doesn’t mention) that the Admin class accepts multiple WSDD files from the command line.

J2EE, Software Development

Java, Collections and Multimap

August 30, 2005 ajohnson 1 Comment

I was in an interview recently and was asked a question which I thought the interviewer called an ‘atagram’, but I think it was actually an anagram. He asked how you could find the largest word in a dictionary where subtracting one letter results in another word (ie: ‘beat’ minus ‘e’ could be ‘tab’). I didn’t come up with an suitable answer during the interview, but during some unrelated reading this week I came across the question and the answer: multimap. Scroll down to the bottom of this page and start reading when you get to multimaps. The trick is that a multimap allows one key to map to multiple values and by alphabetizing each word in the dictionary and then placing the word in the map keyed by the alphabetized word, you can easily find all the available words which result from word minus letter.

J2EE, Open Source, Software Development, XML

XML characters, smart quotes and Apache XML-RPC

August 12, 2005 ajohnson 1 Comment

I’ve been eating my own dogfood with the deliciousposter project (as you can see from my daily links). A couple days ago I posted a some links to del.icio.us and expected them to show up automatically the next day… except they didn’t. I traced it down to an errant smart quote that I copied from the Internet Alchemy Talis, Web 2.0 and All That post, which caused the Apache XML-RPC library to throw this error:
java.io.IOException: Invalid character data corresponding to XML entity ’
I worked under the assumption that the smart quote was an invalid XML character for quite awhile, but it looks like it actually is according to the XML 1.1 specification, the following characters are allowed in an XML document:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
I then checked the source code for the XmlWriter which has this method for writing character data:
... if (c < 0x20 || c > 0xff) { // Though the XML-RPC spec allows any ASCII // characters except '<' and '&', the XML spec // does not allow this range of characters, // resulting in a parse error from most XML // parsers. throw new XmlRpcClientException("Invalid character data " + "corresponding to XML entity &#" + String.valueOf((int) c) + ';', null); } else ..
which turns out to be a tad aggressive. It also turns out that the above code snippet and the version of the Apache XML-RPC library I was using are out of date. The chardata(String text) has been updated in the latest version of the Apache XMl-RPC library to include a new method called isValidXMLChar(char c) which is much more lenient:
if (c == '\n') return true; if (c == '\r') return true; if (c == '\t') return true; if (c and not coincidentally, is compliant with the specification.


I'll be updating deliciousposter to use the latest version of the Apache XML-RPC library soon.  In the meantime, if you're using the Apache XML-RPC library, you should probably download the latest version to take advantage of the new XML character validation method.


		
		Posts navigation
		
			← Previous
1
…
3
4
5
…
19
Next →




		Now with 50% less caffeine!
	
	
		
		What’s Going On Here?
			My name is Aaron Johnson and I created this blog both for me (mostly) and sometimes you. I've been saving my delicious pinboard.in links here and blogging since 2002. During the week (and at night and some weekends and well.. most of the time), I work in engineering product management look after engineering at a software company in Portland, Oregon. When I'm not working, I'm hanging out with my amazing wife, our dinosaur Star Wars loving son three boys,   and five chickens, and giant dog in the burbs outside of Portland, Oregon.
		
See Also
			

Pinboard
Instagram
Bookboard
LinkedIn
Strava
Twitter

		
Monthly Archives

			
					February 2024 (1)
	January 2024 (1)
	December 2023 (1)
	November 2023 (1)
	October 2023 (1)
	September 2023 (1)
	July 2023 (1)
	March 2023 (1)
	February 2023 (1)
	January 2023 (1)
	November 2022 (1)
	October 2021 (1)
	September 2021 (1)
	July 2021 (1)
	June 2021 (2)
	May 2021 (1)
	April 2021 (1)
	February 2021 (3)
	January 2021 (2)
	December 2020 (3)
	November 2020 (3)
	October 2020 (4)
	August 2020 (2)
	July 2020 (3)
	June 2020 (3)
	May 2020 (5)
	April 2020 (4)
	March 2020 (2)
	February 2020 (3)
	January 2020 (4)
	December 2019 (4)
	November 2019 (2)
	October 2019 (4)
	September 2019 (2)
	August 2019 (7)
	July 2019 (3)
	June 2019 (3)
	May 2019 (1)
	April 2019 (4)
	March 2019 (6)
	February 2019 (5)
	January 2019 (4)
	December 2018 (3)
	November 2018 (8)
	October 2018 (2)
	September 2018 (5)
	August 2018 (5)
	July 2018 (4)
	May 2018 (2)
	April 2018 (7)
	March 2018 (5)
	February 2018 (3)
	January 2018 (5)
	December 2017 (5)
	November 2017 (4)
	October 2017 (8)
	September 2017 (2)
	August 2017 (3)
	June 2017 (3)
	May 2017 (2)
	April 2017 (1)
	January 2017 (10)
	December 2016 (4)
	August 2016 (1)
	July 2016 (3)
	June 2016 (5)
	May 2016 (7)
	April 2016 (2)
	March 2016 (7)
	February 2016 (4)
	January 2016 (7)
	December 2015 (2)
	November 2015 (9)
	October 2015 (4)
	September 2015 (8)
	August 2015 (1)
	July 2015 (4)
	June 2015 (5)
	May 2015 (4)
	April 2015 (12)
	March 2015 (5)
	February 2015 (6)
	January 2015 (7)
	December 2014 (6)
	November 2014 (9)
	October 2014 (14)
	September 2014 (9)
	August 2014 (5)
	July 2014 (5)
	June 2014 (8)
	May 2014 (4)
	April 2014 (2)
	March 2014 (2)
	February 2014 (3)
	January 2014 (10)
	December 2013 (2)
	November 2013 (3)
	October 2013 (5)
	September 2013 (5)
	August 2013 (3)
	July 2013 (4)
	June 2013 (4)
	May 2013 (6)
	April 2013 (4)
	March 2013 (3)
	February 2013 (5)
	January 2013 (7)
	December 2012 (1)
	November 2012 (4)
	October 2012 (5)
	September 2012 (3)
	August 2012 (3)
	July 2012 (7)
	June 2012 (5)
	May 2012 (3)
	April 2012 (5)
	March 2012 (5)
	February 2012 (9)
	January 2012 (9)
	December 2011 (10)
	November 2011 (6)
	October 2011 (6)
	September 2011 (5)
	August 2011 (5)
	July 2011 (8)
	June 2011 (13)
	May 2011 (3)
	April 2011 (10)
	March 2011 (6)
	February 2011 (2)
	January 2011 (4)
	December 2010 (8)
	November 2010 (12)
	October 2010 (9)
	September 2010 (6)
	August 2010 (4)
	July 2010 (8)
	June 2010 (9)
	May 2010 (4)
	April 2010 (9)
	March 2010 (6)
	February 2010 (9)
	January 2010 (10)
	December 2009 (10)
	November 2009 (10)
	October 2009 (6)
	September 2009 (10)
	August 2009 (13)
	July 2009 (12)
	June 2009 (11)
	May 2009 (8)
	April 2009 (4)
	March 2009 (7)
	February 2009 (2)
	January 2009 (3)
	December 2008 (4)
	November 2008 (7)
	October 2008 (10)
	September 2008 (6)
	August 2008 (7)
	July 2008 (9)
	June 2008 (15)
	May 2008 (9)
	April 2008 (10)
	March 2008 (8)
	February 2008 (6)
	January 2008 (15)
	December 2007 (10)
	November 2007 (9)
	October 2007 (6)
	September 2007 (9)
	August 2007 (12)
	July 2007 (9)
	June 2007 (6)
	May 2007 (8)
	April 2007 (10)
	March 2007 (14)
	February 2007 (12)
	January 2007 (17)
	December 2006 (11)
	November 2006 (11)
	October 2006 (8)
	September 2006 (11)
	August 2006 (14)
	July 2006 (11)
	June 2006 (13)
	May 2006 (11)
	April 2006 (8)
	March 2006 (5)
	February 2006 (7)
	January 2006 (8)
	December 2005 (6)
	November 2005 (6)
	October 2005 (9)
	September 2005 (3)
	August 2005 (11)
	July 2005 (12)
	June 2005 (11)
	May 2005 (4)
	April 2005 (5)
	March 2005 (8)
	February 2005 (5)
	January 2005 (3)
	December 2004 (6)
	November 2004 (7)
	October 2004 (4)
	September 2004 (9)
	August 2004 (5)
	July 2004 (10)
	June 2004 (12)
	May 2004 (4)
	April 2004 (13)
	March 2004 (10)
	February 2004 (9)
	January 2004 (13)
	December 2003 (8)
	November 2003 (9)
	October 2003 (17)
	September 2003 (28)
	August 2003 (21)
	July 2003 (24)
	June 2003 (31)
	May 2003 (43)
	April 2003 (30)
	March 2003 (48)
	February 2003 (45)
	January 2003 (43)
	December 2002 (28)
	November 2002 (30)
	October 2002 (34)
	September 2002 (41)
	August 2002 (35)
	July 2002 (20)
	June 2002 (1)