Let me be blunt: using price per terabyte as the measure of a data warehouse platform is holding back the entire business intelligence industry.
Consider this… The Five Minute Rule (see here and here) clearly describes the economics of HW technology… suggesting exactly when data should be retained in memory versus when it may be moved to a peripheral device. But vendors who add sufficient memory to abide by the Rule find themselves significantly improving the price/performance of their products but weakening their price/TB and therefore weakening their competitive position.
We see this all of the time. Almost every database system could benefit from a little more memory. The more modern systems which use a data flow paradigm, Greenplum for example, try to minimize I/O by using memory effectively. But the incentive is to keep the memory configured low to keep their price/TB down. Others, like Teradata, use memory carefully (see here) and write intermediate results to disk or SSD to keep their price/TB down… but they violate the Five Minute Rule with each spool I/O. Note that this is not a criticism of Teradata… they could use more memory to good effect… but the use of price/TB as the guiding principle dissuades them.
Now comes Amazon Redshift… with the lowest imaginable price/TB… and little mention of price/performance at all. Again, do not misunderstand… I think that Redshift is a good thing. Customers should have options that trade-off performance for price… and there are other things I like about Redshift that I’ll save for another post. But if price/TB is the only measure then performance becomes far too unimportant. When price/TB is the driver performance becomes just a requirement to be met. The result is that today adequate performance is OK if the price/TB is low. Today IT departments are judged harshly for spending too much per terabyte… and judged less harshly or excused if performance becomes barely adequate or worse.
I believe that in the next year or two that every BI/DW eco-system will be confronted with the reality of providing sub-three second response to every query as users move to mobile devices: phones, tablets, watches, etc. IT departments will be faced with two options:
- They can procure more expensive systems with a high price/TB ratio… but with an effective price/performance ratio and change the driving metric… or
- They can continue to buy inexpensive systems based on a low price/TB and then spend staff dollars to build query-specific data structures (aggregates, materialized views, data marts, etc.) to achieve the required performance.
It is time for price/performance to become the driver and support for some number of TBs to be a requirement. This will delight users who will appreciate better, not adequate, performance. It will lower the TCO by reducing the cost of developing and operating query-specific systems and structures. It will provide the agility so missed in the DW space by letting companies use hardware performance to solve problems instead of people. It is time.
When I was at Greenplum… and now again at SAP… I ran into a strange logic from Teradata about query concurrency. They claimed that query concurrency was a good thing and an indicator of excellent workload management. Let’s look at a simple picture of how that works.
In Figure 1 we depict a single query on a Teradata cluster. Since each node is working in parallel the picture is representative no matter how many nodes are attached. In the picture each line represents the time it takes to read a block from disk. To make the picture simple we will show I/O taking only 1/10th of the clock time… in the real world it is slower.
Given this simplification we can see that a single query can only consume 10% of the CPU… and the rest of the time the CPU is idle… waiting for work. We also represented some I/O to spool files… as Teradata writes all intermediate results to disk and then reads them in the next step. But this picture is a little unfair to Greenplum and HANA as I do not represent spool I/O completely. For each qualifying row the data is read from the table on disk, written to spool, and then read from spool in the subsequent step. But this note is about concurrency… so I simplified the picture.
Figure 2 shows the same query running on Greenplum. Note that Greenplum uses a data flow architecture that pushes tuples from step to step in the execution plan without writing them to disk. As a result the query completes very quickly after the last tuple is scanned from the table.
Let me say again… this story is about CPU utilization, concurrency, and workload management… I’m not trying to say that there are not optimizations that might make Teradata outperform Greenplum… or optimizations that might make Greenplum even faster still… I just want you to see the impact on concurrency of the spool architecture versus the data flow architecture.
Note that on Greenplum the processors are 20% busy in the interval that the query runs. For complex queries with lots of steps the data flow architecture provides an even more significant advantage to Greenplum. If there are 20 steps in the execution plan then Teradata will do spool I/O, first writing then reading the intermediate results while Greenplum manages all of the results in-memory after the initial reads.
In Figure 3 we see the impact of having the data in-memory as with HANA or TimeTen. Again, I am ignoring the implications of HANA’s columnar orientation and so forth… but you can clearly see the implications by removing block I/O.
Now let’s look at the same pictures with 2 concurrent queries. Let’s assume no workload management… just first in, first out.
In Figure 4 we see Teradata with two concurrent queries. Teradata has both queries executing at the same time. The second query is using up the wasted space made available while the CPUs wait for Query 1’s I/O to complete. Teradata spools the intermediate results to disk; which reduces the impact on memory while they wait. This is very wasteful as described here and here (in short, the Five Minute Rule suggests that data that will be reused right away is more economically stored in memory)… but Teradata carries a legacy from the days when memory was dear.
But to be sure… Teradata has two queries running concurrently. And the CPU is now 20% busy.
Figure 5 shows the two-query picture for Greenplum. Like Teradata, they use the gaps to do work and get both queries running concurrently. Greenplum uses the CPU much more efficiently and does not write and read to spool in between every step.
In Figure 6 we see HANA with two queries. Since one query consumed all of the CPU the second query waits… then blasts through. There is no concurrency… but the work is completed in a fraction of the time required by Teradata.
If we continue to add queries using these simple models we would get to the point where there is no CPU available on any architecture. At this point workload management comes into play. If there is no CPU then all that can be done is to either manage queries in a queue… letting them wait for resources to start… or start them and let them wastefully thrash in and out… there is really no other architectural option.
So using this very simple depiction eventually all three systems find themselves in the same spot… no CPU to spare. But there is much more to the topic and I’ve hinted about these in previous posts.
Starting more queries than you can service is wasteful. Queries have to swap in and out of memory and/or in and out of spool (more I/O!) and/or in and out of the processor caches. It is best to control concurrency… not embrace it.
Running virtual instances of the database instead of lightweight threads adds significant communications overhead. Instances often become unbalanced as the data returned makes the shards uneven. Since queries end when the slowest instance finishes it’s work this can reduce query performance. Each time you preempt a running query you have to restore state and repopulate the processor’s cache… which slows the query by 12X-20X. … Columnar storage helps… but if the data is decompressed too soon then the help is sub-optimal… and so on… all of the tricks used by databases and described in these blogs count.
But what does not count is query concurrency. When Teradata plays this card against Greenplum or HANA they are not talking architecture… it is silliness. Query throughput is what matters. Anyone would take a system that processes 100,000 queries per hour over a system that processes 50,000 queries per hour but lets them all run concurrently.
I’ve been picking on Teradata lately as they have been marketing hard… a little too hard. Teradata is a fine system and they should be proud of their architecture and their place in the market. I am proud to have worked for them. I’ll lay off for a while.
Following on to my blog on the Five Minute Rule and in-memory databases here I decided to quickly and informally recalculate the 4KB break-even point based on current technology (rather than use the 2007 numbers) The results are as follows:
- A 1TB SATA Disk with 4.2ms average latency and 126MB/s max transfer rate costs $100 here
- A 4GB DDR3 ECC memory card costs $33 here (I picked fairly expensive ECC memory… I could have gone with the $18 average price mentioned here)
- Apply the Gray/Putzolu formula: Break Even Interval = (Pages per MB of RAM/Accesses per Second per Disk) * ($ per Disk Drive/$ per MB of RAM)
And we find that today the break-even point for a 4KB block of data is 55 minutes…
Again… this means that for any 4KB block of data… or for any database table where there are 4KB blocks that are touched… within a 55 minute window it is more cost-effective to keep the data in-memory than to move it back and forth from disk. If the data is compressed the duration increases with the compression so that a table with 2X compression should reside in-memory if accessed on the average every 110 minutes.
I was recently reminded of a couple of papers written by Jim Gray and Gianfranco Putzolu that calculated the cost of keeping data in memory vs the cost of paging it in from disk. I was happy to see that the thread was being kept alive by Goetz Graefe.
These papers used the cost of each media to determine how “hot” data needed to be to be cost-effectively stored in-memory. The 1987 five minute rule (click here to reference the original papers) was so named because at that time and based on the relative costs of CPU, Memory, and Disk; a 1KB record that was accessed every five minutes could be effectively stored in memory and a 4KB block of data broke-even at two minutes.
In 2009, with CPU prices coming down but the number of instructions executed per second going up, and with memory and prices down, the break-even point between keeping 4KB in memory or on a SATA disk was 90 minutes.
Let’s be clear about what this means. Based solely on the cost of CPUs, RAM, and SATA drives; any data that is accessed more frequently than each 90 minutes should be kept in memory. This does not include any ROI based on the business benefits of a speedy response. It does not adjust for data compression which allows more than 4KB of user data to use 4KB of RAM. Just pure IT economics gets us to this point.
So… if you have data in a data warehouse or a mart that is touched by a query at least once every 90 minutes… it is wasteful to store it on disk. If you have an in-memory database than can compress the data 2X and use it in its compressed form, then the duration goes up to 180 minutes. You do not have to look any further than this to find the ROI for an in-memory data base (IMDB).