Databases & Storage – When 15TB isn’t enough
Oh the database layer
The red headed step child of the stack. The layer that’s incredibly important, yet typically gets very little attention. To be fair there’s been a lot of development recently to help mitigate time spent on the database layer; for example, ORM (object relational mapping) is a great example of the industry’s attempt to remove liability off the development team for the extremely important task of building an efficient and extensible database structure. However, a system built around a database can go miserably wrong very quickly when developers are greedy for data. At the end of the day the name of the game is polynomial time complexity.
Polynomial time complexity
Arguably the most important and underlying concept in database optimization and design is “Polynomial time complexity.” This is a fancy way to determine and measure how efficient an algorithm is. In computer science it’s typically represented as “Big-O” notation. Keep in mind consumer grade computers today are extremely fast, they typically have several cores and can perform billions of operations per second. Servers are often several times more powerful and are capable of handling hundreds if not thousands of users at a time. However, these computers can only handle large scale user bases when algorithms and databases are optimized to deliver data using smaller time complexities.
A few examples of how polynomial time complexities work:
- O(1) – If you’re accessing a record by simply using the index value you can go directly to the result which would give us O(1), meaning the time complexity is constant time. This is the fastest possible algorithm.
- O(log n) – In this example the algorithm scales at an increasing efficient rate as the size of the data set increases. The key word here is ‘rate’; a great example is the efficiency of a binary search tree. Each subsequent node ‘hop’ in a binary search tree search is typically gaining additional efficiency against the input size as it moves down the tree to the result. With extremely large data sets O(log n) can see massive performance gains over even O(n) and will perform very similarly to O(1).
- O(n) – Lets say you’re trying to find all rows ‘where’ the ‘status’ column has the value of ‘selected.’ In this case the database engine would need to look through the entire table and check each row to return the rows that meet your ‘where’ criteria. In this use-case the time complexity is O(n). We represent the parameter as a full “set” ‘n’ to account for different sized tables. At the end of the day the size of the table is less important than how hard the algorithm scales against the table. Another example of a O(n) would be a single for-each loop. This is often referred to as linear time.
- O(n log n) – Often more complex solutions require more than one pass of ‘n’ to complete the algorithm. E.g. fast Fourier transform algorithm.
- O(n²) – Bubble sort, or a nested for-each loop.
- O(n³) — O(n!) – When reaching high big-O values the performance of a system begins to exponentially decay as the number of processes grows. There are very few instances where an algorithm ever needs to be this complex. Traveling salesman problem is an example of sheer ludicrous amounts of processing power needed to compute even a small set of nodes, and it grows by the factorial.
So what does this have to do with databases?
Optimizations in databases often strive to bring algorithms that would normally have a higher time complexity down to a lower time complexity through various strategies. An example would be indexing a table. By indexing a table you can bring down indexed queries from a time complexity of O(n) down to O(log n). This can net massive performance gains on large table sets. Another example would be changing looping queries into a join; this could bring you from a non-indexed O(n²) down to a O(n log n). Shaving off ‘n’ or turning an ‘n’ into a ‘log n’ results in massive CPU performance gains, less disk IO, and sometimes even less bandwidth depending on how the system is setup.
Lots of developers focus on which specific database is used “OMG – Nginx is sooo much better than Apache *sips wine snobbishly*”. However, I feel that in most applications the time complexity built in and around the database is far more important to creating a sustainable and extensible system. It’s often very difficult to notice that time complexities are a problem until the data sets gets to a critical point where all CPU is consumed and one begins to wait for data.