CIS 552: Database Design

I took the final exam today in my CIS 552 Database Design course. I think I did ok on the exam but more importantly, I learned a lot this semester. I think the main takeway from the course is that it helps to realize that a database application (meaning an application like MySQL, Oracle or SQL Server, not a database instance like Northwind) is more than just SQL, in fact the bottom level of a database management system stack is “just” reading and writing files to the file system, albeit with transactions and atomicity and multiple users and well, a bunch of other stuff but bear with me. See, when you first start out, you do things like SELECT * FROM table and then you read somewhere how inefficient it is to bring back all the records from your table and so you use SELECT id FROM table. And then pretty soon you hit a wall where you have a query that takes like 5 seconds to run and you learn about indexes. Well, if you’re like me, you just learn that indexes make things go faster… you might read SQL Server books online and see that an index in SQL Server is a B-tree, but you don’t go any further than that and so you just come away knowing that indexes make your application run fastertm. After that, you read find someone somewhere that says that the order of your selection criteria matters and that you should try and give hints to the query optimizer when in fact, the query optimizer can do whatever the hell it wants with your declarative SQL statement and probably will.

And then you take a class like this and you learn how a database management system stores records in pages and you get to write some code that stores and retrieves bytes to and from said pages, all the while paying attention to the page header which stored (in our case) the page offset, the number of records in the page and a record size / pointer data structure, all of which is just part of a big file. After that you write code that looks up records in an index, which turns out to be a file as well. Finally, you use a block nested loop to process a join query and it turns out that again, you’re just reading bytes from a file system.

So now I understand why the guys at viaweb didn’t use a database (although I’d never go there myself) and I can have a somewhat intelligent conversation about the differences between a linear search, a binary search, an index scan, a B-tree index seek, and a hash index lookup. And oh man do I wish I was a math major in college.

Other cool stuff I saw along the way:

· On the Goodness of Binary Search by Tim Bray

· General Purpose Hash Function Algorithms

· Binary logarithm

· Google SparseHash a project that contains several hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for space and one that optimizes for speed.

· The Anatomy of a Large-Scale Hypertextual Web Search Engine: “Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.” (emphasis mine)

· Lucene architecture

· Quicksort

SQL Server Group By Datetime

If you’ve ever used SQL Server, you know that there’s no such thing as a Date or Time, only Datetime. According to SQL Server Books Online: “If only a time is specified when setting a datetime or smalldatetime value, the date defaults to January 1, 1900. If only a date is specified, the time defaults to 12:00 A.M. (Midnight).”, which is all is fine and good but it becomes an issue when you want just the date or just the time from a column that stores both. Say for instance that my boss wants to see the number of orders per day over the last couple years to see which days have the most orders (‘cyber monday!‘). Unfortunately, there’s no function called date() or time() which returns just the date portion of the datetime or just the time portion of the datetime:

-- doesn't work...
SELECT date(mydate) as thedate, count(id) as perday
FROM orders
GROUP by thedate
ORDER by perday DESC

Turns out there’s a hack that does though:

SELECT distinct convert(varchar,mydate,111) as thedate, count(id) as perday
FROM orders
group by convert(varchar,mydate,111)
order by perday DESC

Hope that makes your day.

Java, JTDS, PreparedStatement and varchar

I’ve been working on an interesting application at work that needs to be fast, the faster the better in fact. I wrote a couple quick and dirty implementations in my scratchpad in Eclipse and I figured that I could get about fifty operations per second (a database UPDATE is involved for every operation among other things). Anyway, I went to develop a full implementation and a then ran a full test of about 100,000 operations. Instead of taking about 30 minutes (100,000 operations / 50 per second = ~ 30 minutes) the operation took about 7 hours. I was getting about 4 operations per second throughput, which was obviously a huge disappointment. The pseudocode I wrote originally looked something like this:

Connection c = DriverManager.getConnection(cs);
String q = "UPDATE mytable SET x = 1 WHERE id = ?";
PreparedStatement p = c.prepareStatement(q);
for (int i=0; i

and it worked well. I made a single change during development: instead of using the ‘id ‘ column of the database table (a numeric 9 byte primary key and thus is the clustered index for the table) I used a 13 byte varchar column as the identifier which had a nonclustered index, my code looked like this:

Connection c = DriverManager.getConnection(cs);
String q = "UPDATE mytable SET x = 1 WHERE y = ?";
PreparedStatement p = c.prepareStatement(q);
for (int i=0; i

The nonclustered index performed just as well as the clustered index: in my testing an UPDATE statement using the varchar column as the constraint in the query worked just as fast as the primary key / clustered index, which makes sense because index seeks (which I learned about in my database design class this semester) on a 9 byte / 72 bit numeric value (because I used a precision of 19 digits) should be similar to index seeks on a 13 byte / 104 bit varchar column. So then I executed the finished program (not the test) and brought up SQL Profiler (a tool that ships with SQL Server that can debug, troubleshoot, monitor, and measure your application’s SQL statements and stored procedures). It quickly became clear what the problem was. Here’s the SQL created by the prepareStatement() method:

create proc #jtds000001 @P0 varchar(4000) as UPDATE mytable SET x = 1 WHERE y = @P0

and then the executeUpdate() method:

exec #jtds000001 N'005QDUKS1MG8K'

See the problem? The JTDS driver turned the 13 byte varchar column into a 4000 byte varchar column (the maximum number of bytes for a column) and then prefixed the parameter with ‘n’, which is used to identify Unicode data types. This substitution caused the query processor to ignore the index on ‘y’ and do an index scan instead of an index seek.

Here’s where is gets fun. Microsoft SQL Server uses a B-tree index structure (also on wikipedia), which is similar to a B+tree, except that search key values can only appear once in the tree. Objects are stored in SQL Server as a collection of 8KB pages and (because of the class I’ve been taking) I now know that you can compute the approximate number of disk IO’s for an index seek as:

logn/2(k)

where n is the number of keys per node and k is the number of search keys. So with one million search keys and 8KB pages in SQL Server, a index on a 13 byte key would create a tree with about 615 nodes (~8000 / 13 = ~615). Thus the index seek in my system was costing about log615/2(1000000) = 2.4 node accesses (one node access ~= one disk IO) versus an index scan (615 nodes @ 8KB each, figure that on average over time we’ll find the value in 615/2 so ~307 node accesses?) which is significantly longer and obviously the cause of the problem.

Moral of the story: watch out for char / varchar constraint parameters when using JTDS and a PreparedStatement. Also, indexes are A Good ThingTM.

Updated 12/04/2005: Brian Heineman (one of the maintainers of the JTDS project) points out that this is a feature, not a bug. He also points out that you can work around the issue by appending:

sendStringParametersAsUnicode=false;

to your database connection string (I tested it out and it works just as advertised). Since the real issue is that JTDS can’t tell if the String instance I’m sending is Unicode or not and so defaults to a Unicode string, the other workaround would be to use the setBytes() method of the PreparedStatement and the use the byte[] representation of the String. From my example above:

p.setBytes(1, somearray[i].getBytes());

Links: 12-1-2005

  • SwitchTower: Automating Application Deployment
    SwitchTower is a standalone utility that can also integrate nicely with Rails. You simply provide SwitchTower with a deployment recipe that describes your various servers and their roles, and voila! You magically have single-command deployment. It …
    (categories: automation deployment rails ruby )

  • RDT – Ruby Development Tools: Welcome
    RDT is an open source Ruby IDE for the Eclipse platform. Features supported are syntax highlighting, on the fly syntax check, graphical outline, Test::Unit view/runner, Ruby application launching, content assist, source formatter, and a Ruby debugger.
    (categories: eclipse ide ruby )