The shared-nothing architecture has, from the beginning, offered the promise of using hardware to solve performance problems rather than applying staff and tuning. By this I mean… if you can add nodes and scale out to improve query response then why not throw hardware at performance problems rather than build a fragile infrastructure of aggregate tables, cubes, pre-joined/de-normalized marts, materialized views, indexes, etc. Each of these performance workarounds are both expensive to build and expensive to operate.
There are several reasons, I think tuning has been more popular than scaling. Not in any particular order:
First, hardware vendors made it too hard to order/provision new nodes. You could not just press a button and buy capacity. Vendors wanted to charge you for terabytes when all you wanted might be CPU and Memory to fix the problem (see here, sigh). You had to negotiate a deal with a rep, work through your procurement group, wait weeks for delivery. Then, the hardware you have might not match the hardware for sale. New models could not be mixed with old nodes… so you had to consider a whole new cluster. The process was so not-agile. There have been attempts to fix this… and some of them are credible… but none are popular.
Next, the process to install the new nodes was moderately difficult… not rocket science but not seamless to be sure. Data had to move. Backups had to be reconfigured and sometimes old backups could not be easily restored to the new configuration. There was no easy way to burn in the new hardware and if it failed early there were issues reversing the process. It just was not considered an everyday operational process… it was the exception and that made it tough. This process too has improved over time but it never became a no-brainer.
Finally, buying hardware is a capital expense (CAPEX). Even if you had to pay more in people costs to do the hard work of tuning those were operational expenses… and funding was easier to get.
Redshift changes the game here. Even if the Paraccel database is just OK (see here)… and if the overhead of running in the virtualized AWS environment makes it worse… it is still OK. You can provision new hardware in a couple of minutes. If Teradata is 25% faster than Paraccel for your query set… so what? You can add 25% more Redshift for a fraction of the extra cost of Teradata. Need more performance? Dial it in. Need permission? No problem because it is all OPEX dollars.
Redshift will deliver the flexibility to make scale out less expensive than tune it out. The TCO reductions from running a simple system where hardware solves performance problems instead of ETL and staff will be significant. This is how it always should have been.
The issue for Redshift will be… given the trend to reduce the data latency from operations to BI… can you move significant amounts of data from on-premise into the cloud fast enough to meet service level agreements?
Do not overlook Redshift… Amazon could be a player in the EDW space… But look for other databases to make inroads here as well. In-memory databases could work well in the cloud as they avoid some of the hardware abstraction required to access disks.
- Amazon preparing ‘disruptive’ big data AWS service? (go.theregister.com)
- Amazon RedShift Open for Everyone (ozeanmedia.com)
Vertica is the product I saw the most. In fact before they were acquired they were beginning to pop up a lot. The product is innovative in several dimensions. It is wholly column-oriented with several advanced columnar optimizations. Vertica has an advanced data loading strategy that quickly commits data and then creates the column orientation in the background. This greatly reduces the load time but slows queries while the tuples are transformed from rows to columns. Vertica offers a physical construct called a projection that may be built to greatly speed up query performance. Further, they provide a very sophisticated design workbench that will automatically generate projections.
Paraccel is the company I saw the least. I never saw it win… but I know that it does, in fact, win here and there. My impression, and I use this fuzzy word intentionally as I just don’t know, is that Paraccel is a solid product but it does not possess any fundamental architectural advantages that would allow it to win big. Every product will find an acorn now and again. The question is whether in a POC with the full array of competitors there is a large enough sweet spot to be commercially successful? I think that Paraccel cannot win consistently against the full array. Paraccel is now the basis for the Amazon Redshift data warehouse as a service offering. This will keep the product in the game for a while even if the revenues from a subscription model do not help the business much.
Where They Win
Vertica wins when projections are used for most queries. They are not likely to win without projections. This makes them a very effective platform for a single application data mart with a few queries that require fast performance… or for a data mart where the users tend to submit queries that fit into a small number of projection “grooves”.
Paraccel wins in the cloud based on Redshift. They can win based on price. They win now and again when the problem hits their sweet spot. They win when they compete against a small number of vendors (which increases their sweet spot).
Where They Lose
Vertica loses in data warehouse applications where queries cut across the data in so many ways that you cannot build enough projections. Remember that projections are physical constructs with redundant data.
Paraccel loses on application-specific marts and other data marts when the problems fit either Vertica’s projections or Netezza’s zone maps. They lose data warehouse deals to both Teradata and Greenplum when the query set is very broad. They will lose in Redshift when performance is the key… maybe. I have always thought that shared-nothing vendors made it too hard and too expensive to scale out. It should always have been easier to add hardware to improve performance than to apply people to tuning… but this has not been the case… maybe now it is (see here)?
In the Market
Since the HP acquisition the number of times Vertica shows up as a competitor has actually dropped. I cannot explain this but HP has had a difficult time becoming a player in the data warehouse space and had several false starts (Neoview, Exadata, …). The product is sound and I hope that HP figures this out… but HP is primarily a server vendor and it will be difficult for them to sell Vertica and stay agnostic enough to also sell HANA, Oracle, Greenplum, and others.
The Amazon Redshift deal breathes life into Paraccel. They have to hope that the exposure provided by Amazon will turn into on-premise business for them. They are still a venture-funded small company who has to compete against bigger players with larger sales forces. It will be tough.
My Guess at the Future
I worry about Vertica in the long run.
Until the Amazon deal I would have guessed that Paraccel was done… again, not because their technology was bad… it is not… but because it was not good enough to create a company that could go public and there was no apparent buyer… no exit. The Amazon Redshift deal may provide an exit. We will see? Maybe Amazon can take this solid technology into the cloud and make it a winner?
When I was at Greenplum… and now again at SAP… I ran into a strange logic from Teradata about query concurrency. They claimed that query concurrency was a good thing and an indicator of excellent workload management. Let’s look at a simple picture of how that works.
In Figure 1 we depict a single query on a Teradata cluster. Since each node is working in parallel the picture is representative no matter how many nodes are attached. In the picture each line represents the time it takes to read a block from disk. To make the picture simple we will show I/O taking only 1/10th of the clock time… in the real world it is slower.
Given this simplification we can see that a single query can only consume 10% of the CPU… and the rest of the time the CPU is idle… waiting for work. We also represented some I/O to spool files… as Teradata writes all intermediate results to disk and then reads them in the next step. But this picture is a little unfair to Greenplum and HANA as I do not represent spool I/O completely. For each qualifying row the data is read from the table on disk, written to spool, and then read from spool in the subsequent step. But this note is about concurrency… so I simplified the picture.
Figure 2 shows the same query running on Greenplum. Note that Greenplum uses a data flow architecture that pushes tuples from step to step in the execution plan without writing them to disk. As a result the query completes very quickly after the last tuple is scanned from the table.
Let me say again… this story is about CPU utilization, concurrency, and workload management… I’m not trying to say that there are not optimizations that might make Teradata outperform Greenplum… or optimizations that might make Greenplum even faster still… I just want you to see the impact on concurrency of the spool architecture versus the data flow architecture.
Note that on Greenplum the processors are 20% busy in the interval that the query runs. For complex queries with lots of steps the data flow architecture provides an even more significant advantage to Greenplum. If there are 20 steps in the execution plan then Teradata will do spool I/O, first writing then reading the intermediate results while Greenplum manages all of the results in-memory after the initial reads.
In Figure 3 we see the impact of having the data in-memory as with HANA or TimeTen. Again, I am ignoring the implications of HANA’s columnar orientation and so forth… but you can clearly see the implications by removing block I/O.
Now let’s look at the same pictures with 2 concurrent queries. Let’s assume no workload management… just first in, first out.
In Figure 4 we see Teradata with two concurrent queries. Teradata has both queries executing at the same time. The second query is using up the wasted space made available while the CPUs wait for Query 1’s I/O to complete. Teradata spools the intermediate results to disk; which reduces the impact on memory while they wait. This is very wasteful as described here and here (in short, the Five Minute Rule suggests that data that will be reused right away is more economically stored in memory)… but Teradata carries a legacy from the days when memory was dear.
But to be sure… Teradata has two queries running concurrently. And the CPU is now 20% busy.
Figure 5 shows the two-query picture for Greenplum. Like Teradata, they use the gaps to do work and get both queries running concurrently. Greenplum uses the CPU much more efficiently and does not write and read to spool in between every step.
In Figure 6 we see HANA with two queries. Since one query consumed all of the CPU the second query waits… then blasts through. There is no concurrency… but the work is completed in a fraction of the time required by Teradata.
If we continue to add queries using these simple models we would get to the point where there is no CPU available on any architecture. At this point workload management comes into play. If there is no CPU then all that can be done is to either manage queries in a queue… letting them wait for resources to start… or start them and let them wastefully thrash in and out… there is really no other architectural option.
So using this very simple depiction eventually all three systems find themselves in the same spot… no CPU to spare. But there is much more to the topic and I’ve hinted about these in previous posts.
Starting more queries than you can service is wasteful. Queries have to swap in and out of memory and/or in and out of spool (more I/O!) and/or in and out of the processor caches. It is best to control concurrency… not embrace it.
Running virtual instances of the database instead of lightweight threads adds significant communications overhead. Instances often become unbalanced as the data returned makes the shards uneven. Since queries end when the slowest instance finishes it’s work this can reduce query performance. Each time you preempt a running query you have to restore state and repopulate the processor’s cache… which slows the query by 12X-20X. … Columnar storage helps… but if the data is decompressed too soon then the help is sub-optimal… and so on… all of the tricks used by databases and described in these blogs count.
But what does not count is query concurrency. When Teradata plays this card against Greenplum or HANA they are not talking architecture… it is silliness. Query throughput is what matters. Anyone would take a system that processes 100,000 queries per hour over a system that processes 50,000 queries per hour but lets them all run concurrently.
I’ve been picking on Teradata lately as they have been marketing hard… a little too hard. Teradata is a fine system and they should be proud of their architecture and their place in the market. I am proud to have worked for them. I’ll lay off for a while.
I recently pointed out some silliness published by Teradata to several SAP prospects. There is more nonsense that was sent and I’d like to take a moment to clear up these additional claims.
In their note to HANA prospects they used the following numbers from the paper SAP published here:
|# of Query Streams||
|# of Queries per Hour (Throughput)||
Teradata makes several claims from these numbers. First they claim that the numbers demonstrate a bottleneck that is tied to either the NUMA effect or to the SMP Knee Curve. This nonsense is the subject of a previous blog here.
For any database system as you increase the number of queries to the point where there is contention the throughput decreases. This is just common sense. If you have 10 cores and 10 threads and there is no contention then all threads run at the same speed as fast as possible. If you add an 11th thread then throughput falls off, as one thread has to wait for a core. As you add more threads the throughput falls further until the system is saturated and throughput flattens. Figure 1 is an example of the saturation curve you would expect from any system as the throughput flattens.
There are some funny twists to this, though. If you are an IMDB then each query can use 100% of a core. If you are multi-threaded IMDB then each query can use 100% of all cores. If you are a disk-based system then you give up the CPU to another query while you wait for I/O… so throughput falls. I’ll address these twists in a separate blog… but you will see a hint at the issue here.
Teradata claims that these numbers reflect a scaling issue. This is a very strange claim. Teradata tests scaling by adding hardware, data, and queries in equal amounts to see if the query performance holds constant… or they add hardware and data to look for a correlation between the number of nodes and query performance… hoping that as the nodes increase the response time decreases. In fact Teradata scales well… as does HANA… But the hardware is constant in the HANA benchmark so there is no view into scaling at all. Let me emphasis this… you cannot say anything about scaling from the numbers above.
Teradata claims that they can extrapolate the saturation point for the system… this represents very bad mathematics. They take the four data points in the table and create an S curve like the one in Figure 1… except they invert it to show how throughput decreases as you move towards the saturation point… Figure 2 shows the problem.
If you draw a straight line through the curve using any sort of math you miss the long tail at the end. This is an approximation of the picture Teradata drew… but even in their picture you can see a tail forming… which they ignore. It is also questionable math to extrapolate from only four observations. The bottom line is that you cannot extrapolate the saturation point from these four numbers… you just don’t know how far out the tail will run unless you measure it.
To prove this is nonsense you just have to look here. It turns out that SAP publicly published these benchmark results in two separate papers and this second one has numbers out to 60 streams. Unsurprisingly at 60 streams HANA processed 112,602 queries per hour while Teradata told their customers that it would saturate well short of that… at 49,601 queries (they predicted that HANA would thrash and the number of queries/hour would fall back… more FUD).
Teradata is sending propaganda to their prospects with scary extrapolations and pronouncements of architectural bottlenecks in HANA. The mathematics behind their numbers is weak and their incorrect use of deep architectural terms demonstrates ignorance of the concepts. They are trying to create Fear, Uncertainty, and Doubt. Bad marketing… not architecture, methinks.
Here is one I composed for SAP on the HANA blog about the recent Microsoft SQL Server announcements that is not too obnoxiously pro-HANA. It is more about the data architecture required to handle a world where the client is a mobile device and every query must complete sub-second. This, I believe is where we are headed… taking those BI queries that run in an hour on weak warehouses and improving the response to 10 seconds won’t cut it if your user is on a mobile device… and if the query is customer-facing you will be out of business…
The only way to solve for this is to get lots of silicon between you and your data… and hope that no queries miss the cache… or put it all in-memory.
I might have added to the work post that anytime a database vendor pre-announces a product that is due out in 1-2 years, ”2014-2015″ in this case, it is marketing not architecture… meant to freeze SQL Server customers in place while Microsoft tries to catch up.
Make sure to have a look at the comments… there is a great link to a Microsoft mouthpiece who suggests that I must have no technical background and that I am a liar. Nice.
One more surprise…
In the past SAP applications have, in general, avoided using database features. Even a SELECT with a projection was out-of-bounds. They did not want to depend on any database, so they tended to pull all data from the data layer to the application layer and loop through the data using procedural languages like ABAP. You might say that they were religiously database agnostic. My mistake… you might say that we were religiously database agnostic. I have to get used to these new surroundings.
Besides the obvious attributes of HANA: in-memory, shared-nothing, MPP, and column-oriented… the aim is to move the application logic next to the data and into HANA.
Any of you who have labored to convert procedural code into set-based SQL will understand the issue here. There are hundreds of thousands or millions of lines of procedural code… often very simple loops… that have to be converted to SQL to make the HANA architecture support the SAP application portfolio.
The surprise is not that there is this outstanding issue.. nor is it the ambitious architecture designed to push the application deep into the database (we are not talking about SQL-based stored procedures… we are talking about the application). The surprise is that the HANA development team has built a state-of-the-art facility that programmatically converts procedural logic into its set-based equivalent (not necessarily into SQL but sometimes into a language that can execute in-parallel). This is not a tool requiring manual intervention… it is an automatic, mathematically provable, transformation.
Right now the technique is used to covert logic in stored-procedures and in ABAP. But I hope to see it applied in the optimizer to convert those ugly Oracle cursor loops on-the-fly.
You can read more here.
By the way… SAP will continue to support ABAP using the database as a file server… moving all of the data from the database server to the application server for processing. But you can imagine that… when running applications that use this powerful capability… over time HANA will emerge with a huge performance advantage over other databases…
Oracle should be worried.
I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:
But I do not see it?
Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).
HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.
I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.