I took the final exam today in my CIS 552 Database Design course. I think I did ok on the exam but more importantly, I learned a lot this semester. I think the main takeway from the course is that it helps to realize that a database application (meaning an application like MySQL, Oracle or SQL Server, not a database instance like Northwind) is more than just SQL, in fact the bottom level of a database management system stack is “just” reading and writing files to the file system, albeit with transactions and atomicity and multiple users and well, a bunch of other stuff but bear with me. See, when you first start out, you do things like SELECT * FROM table
and then you read somewhere how inefficient it is to bring back all the records from your table and so you use SELECT id FROM table
. And then pretty soon you hit a wall where you have a query that takes like 5 seconds to run and you learn about indexes. Well, if you’re like me, you just learn that indexes make things go faster… you might read SQL Server books online and see that an index in SQL Server is a B-tree, but you don’t go any further than that and so you just come away knowing that indexes make your application run fastertm. After that, you read find someone somewhere that says that the order of your selection criteria matters and that you should try and give hints to the query optimizer when in fact, the query optimizer can do whatever the hell it wants with your declarative SQL statement and probably will.
And then you take a class like this and you learn how a database management system stores records in pages and you get to write some code that stores and retrieves bytes to and from said pages, all the while paying attention to the page header which stored (in our case) the page offset, the number of records in the page and a record size / pointer data structure, all of which is just part of a big file. After that you write code that looks up records in an index, which turns out to be a file as well. Finally, you use a block nested loop to process a join query and it turns out that again, you’re just reading bytes from a file system.
So now I understand why the guys at viaweb didn’t use a database (although I’d never go there myself) and I can have a somewhat intelligent conversation about the differences between a linear search, a binary search, an index scan, a B-tree index seek, and a hash index lookup. And oh man do I wish I was a math major in college.
Other cool stuff I saw along the way:
· On the Goodness of Binary Search by Tim Bray
· General Purpose Hash Function Algorithms
· Google SparseHash a project that contains several hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for space and one that optimizes for speed.
· The Anatomy of a Large-Scale Hypertextual Web Search Engine: “Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.” (emphasis mine)