In the post here I listed the units of parallelism (UoP) applied by various products on a single node. Those findings are summarized in the table below.
Cores per Node
UoP per Node
|Greenplum||DCA UAP Edition||
|Recommends 1 Segment for each 2 cores. Maybe some multi-threading per query so it could be greater than 8 on the average… and could be 16 with hyper-threads… but not more than 32 for sure.|
|Maybe only 12… cannot find if they use hyper-threads.|
|May use hyper-threads but limited by 16 FPGAs.|
|HANA||Any Xeon E7-4800||
A UoP is defined as the maximum number of instructions that can execute in parallel on a single node for a single query. Note that in the comments there was a lively debate where some readers wanted to count threads or processes or slices that were “active” but in a wait state. Since any program can start threads that wait I do not count these as UoP (later we might devise a new measure named units of waiting that would gauge the inefficiency in any given design by measuring the amount of waiting around required to keep the CPUs fed… maybe the measure would be valuable in measuring the inefficiency of the queue at your doctor’s office or at any government agency).
On some CPUs vendors such as Intel allow two threads to execute instructions in-parallel in a core. This is called hyper-threading and, if implemented, it allows for two UoP on a single core. Rather than constantly qualify the statements for the rest of this blog when I refer to cores I mean to imply hyper-threads.
The lively comments in the blog included some discussion of the sort of techniques used by vendors to try and keep the cores in the CPU on each node fed. It is these techniques that lead to more active I/O streams than cores and more threads than cores.
For several years now Intel and the other CPU manufacturers have been building ever more cores into their products. This has allowed them to continue the trend known as Moore’s Law. Multi-core is now a fact of life and even phones, tablets, and personal computers have multi-core chips.
But if you look at the table you can see that the database products above, even the newly announced products from Teradata and Netezza, are using CPUs with relatively few cores. The high-end Intel processors have 40 cores and the databases, with the exception of HANA, use Intel products with at most 16 cores. Further, Intel will deliver Ivy Bridge processors to the market this year with 120 cores. These vendors know this… yet they have chosen to deliver appliances with the previous generation CPUs. You might ask why?
I believe that there is an architectural reason for this (also a marketing reason covered here).
It is very hard to keep 80 cores fed with data when you have to perform block I/O. It will be nearly impossible to keep the 240 cores coming with Ivy Bridge fed. One solution is to deploy more nodes in a shared-nothing configuration with fewer cores per node… but this will be expensive requiring more power, floorspace, administration, etc. This is the solution taken by most of the vendors above. Another solution is to solve the problem without I/O with an in-memory database (IMDB) architecture. This is the solution taken by SAP with HANA.
Intel, IBM, and the rest will continue to build out using the multi-core approach for the foreseeable future. IMDB products will be able to fully utilize this product. Other products will struggle to take full advantage as we can see already… they will adapt and adjust and do what they can… but ultimately IMDB will win, I think… because there is just no other way to keep up as Moore’s Law continues to drive technology… no other way to feed the CPU engines with data fast enough.
If I am right then you will see more IMDB offerings from more vendors, including from the major vendors in the near future (note that this does not include the announcements of “database in memory” from Oracle which is not by any measure an in-memory database).
This is the underlying reason why Donald Feinberg (and Timo Elliott) are right on here. Every organization will be running in-memory… and soon.
Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…
To help pay the bills please consider this as you read on…
OK, I hate Oracle marketing (see here and here). They are happy to skirt the edge of the credible too often. But let’s be real… Exadata was a very smart move… even if it a flawed product. The flaws are painful but not fatal… and Oracle can now play in the data warehouse space in places they could not play before. I do not believe that Exadata is a strong competitor as you will see below… it will not win many “fair” POCs… but the fight will be more than close enough to make customers with existing Oracle warehouses pick Exadata once they consider the cost of migration. This is tough… it means that customers are locked in to a relatively weak alternative… and every Oracle customer (and every Teradata customer and every SQL Server customer and every DB2 customer) should consider the long-term costs of vendor lock-in. But each customer has to weigh this for themselves… and this evaluation of the cost of lock-in is about neither architecture nor marketing…
Where They Win
First and foremost Exadata wins when there is an existing data warehouse or data mart on Oracle that will have to be migrated. My recommendation to customers is that they think about this carefully before they engage other vendors. It is a waste of everybody’s time to consider alternatives when in the end no alternative has a chance… and it is a double waste to do a POC when even a big technical win by a competitor cannot win them the business.
Exadata can win technically when the data “working set” is small. This allows Exadata to keep the hot data in SSD and in memory and better still, in the RAC layer. This allows Oracle to win POCs where that can suggest a subset of the EDW data is all that is required.
Exadata can win when the queries required, or tested, contain highly selective predicates that can be pushed down in the first steps of the explain plan. Conversely, Exadata bonks when lots of data must be pulled to the RAC layer to perform a join step.
Where They Lose
Everyone who has an Exadata system or who is considering one should view the two videos here. The architectural issues are apparent… and you can then consider the impact for your workload.
As noted above… in an Exadata execution plan the early simple table scans and projection are executed in the storage layer… subsequent steps occur in the RAC layer… if lots of data has to be moved up then the cluster chokes.
There are times when the architectural limitations are just too large and a migration is required to meet the response time requirements for the business. This often happens when Exadata is to support a single application rather than a data warehouse workload… In other words, if the cost of migrating away from Oracle is small, either because the applications to be moved are small or because an automated tool is available to mitigate the cosy or because the migration costs are subsidized by another source, then Exadata can lose even when there is a migration required.
Exadata can be beat on price… unless you count the cost of migration.
In the Market
For the reasons above, Exadata wins for current Oracle customers. There was a honeymoon when Exadata was winning some greenfield deals against other competitors… but these are now more rare.
My Guess at the Future
I think that the basic architecture of Exadata is defensible… having a split configuration is , after all, not completely foreign. Teradata and Greenplum and others use master nodes split from data nodes… and this is where is I predict we’ll see Oracle go. Over time, more execution steps will move to the storage layer and out of the RAC layer and in the end, Exadata will look ever more like a shared-nothing implementation. This just has to be the architectural way forward for Exadata (but don’t expect LE to stand up anytime soon and admit that he was wrong all of these years about the value of a shared-nothing architecture).
Phil has alerted us that there will be some OLTP/BI enhancements coming (see the comments section here)… which stole away a prediction I would have made otherwise.
The bottlenecks pointed out by Kevin Closson (as above and more here) need to be addressed… but to some extent these issues are the result of hardware constraints… and the combination of better hardware configurations and the push-down of more execution steps can mitigate many of the issues.
It will be a while before the Exadata architecture evolves to a point where the product is more competitive… and from now to then I think the World will be as I described it above… Oracle zealots will pick Exadata either as a religious stance or to avoid the cost of a migration… others will mostly go elsewhere…
Coming next… my 2 Cents on Netezza…
Teradata is circulating a document to customers that claims that the numbers SAP has published in its 100TB PoC white paper (here) demonstrates that HANA suffers from scaling issues associated with the NUMA-effect. The document is so annoyingly inaccurate that I have to respond.
NUMA stands for non-uniform-memory-access. This describes an architecture whereby each core in a multi-core system has some very fast local memory accessed directly through a memory bus… but has access to every other core’s local memory through a “remote” access hop over another fast bus. In the case of Intel Xeon servers the other fast bus is know as the QPI bus. “Non-uniform” means that all memory access are not equal… a remote access over the QPI bus is slower than access over the memory bus.
The first mistake in the Teradata document is where they refer to the problem as the “SMP Knee Curve”. SMP stands for symmetric multi-processing… an architecture where multiple cores share the same memory bus. The SMP Knee Curve describes the problem when too many cores are contending for the same bus. HANA is not certified to run on an SMP system. The 100TB PoC described above is not run on an SMP system. When describing issues you might expect Teradata to at least associate the issue with the correct hardware architecture.
The NUMA-effect describes problems scaling processors within a single NUMA node. Those issues can impact the ability to continuously add cores as memory locking issues across the QPI bus slow the system. There are ways to mitigate this problem, though (see here for some examples of how to code around the problem).
Of course HANA, which built an in-memory system with NUMA as a target from the start… has built in these NUMA mitigations. In fact, HANA is designed deeper still using special techniques to keep the processor caches filled and to invoke special-purpose SIMD instructions. HANA is built so close to the hardware that processor cycles that are unused due to cache misses but show up as processor busy are avoided (in other words, HANA will get more work done on a 100% CPU busy system than other software that will show 100% CPU busy). But Teradata chose to ignore this deep integration… or they were unaware of these techniques.
Worse still, the problem Teradata calls out… shouts out… is about scaling over 100 nodes in a shared-nothing configuration. The NUMA-effect has nothing at all to do with scale out across nodes. It is an issue within a single node. For Teradata to claim this is silliness at best. It is especially silly since the shared-nothing architecture upon which HANA is built is the same architecture Teradata uses.
The twists Teradata applies to the numbers are equally absurd… but I’ll stop here and hope that the lack of understanding they exhibit in throwing around terms like “SMP Knee Curve” and “NUMA-effect” will cast enough doubt that the rest of their marketing FUD will be suspect. Their document is surely not about architecture… it is weak marketing… you can see more here…
As you may have noticed I’m looking at in-memory databases (IMDB) these days… Here are some quick architectural observations on VMWare‘s SQLFire, Oracle’s Exalytics and TimesTen offerings, and SAP HANA.
It is worth noting up front that I am looking to see how these products might be used to build a generalized data mart or a data warehouse… In other words I am not looking to compare them for special case applications. This is important because each of these products has some extremely cool features that allow them to be applied to application-specific purposes with a narrow scope of data and queries… maybe in a later blog I can try to look at some narrow use-cases.
Further, to make this quick blog tractable I am going to assume that the mart/dw problem to be solved requires more data than can fit on one server node… and I am going to ignore features that let queries access data that resides on disk… in-memory or bust.
Finally I will assume that the SQL dialect supported is sufficient and not drill into details there. I will look at architecture not SQL features…
Simply put I am going to look at a three characteristics:
- Will the architecture support ad hoc queries?
- Does the architecture support scale-out?
- Can we say anything with regards to price/performance expectations?
Exalytics is a smart-aggregate store that sits over an Oracle database to offload aggregate query workload (see my previous post here or the Rittman Mead post here which declares: “Oracle Exalytics uses a specially enhanced version of Oracle TimesTen, Oracle’s in-memory database, to cache commonly used aggregates used in dashboards, analyses and other BI objects.” Exalytics does not support a scale-out shared-nothing architecture but it can scale up by adding nodes with new aggregate data. Queries access data within the aggregate structure and it is not possible to join to data off the Exalytics node… so ad hoc is out. Within these limits, which preclude Exalytics from being considered as a general platform for a mart or warehouse, Exalytics provides dictionary-based compression which should provide around 5X compression to reduce the amount of memory required and reduce the amount of hardware required.
TimesTen can do more. It is a general RDBMS. But it was designed for OLTP. I assume that the reason that Oracle has not rolled it out as a general-purpose data mart or data warehouse has to do with constraints that grow from those OLTP architectural roots. For example, BI queries run longer and require more data than a OLTP query… and even with data in-memory temporary storage is required for each query… and memory utilization is a product of the amount of data required and the amount of time the data has to inhabit memory… so BI queries put far more pressure on an in-memory DBMS. There are techniques to mitigate this… but you have to build the techniques in from the ground up.
I imagine that this is why TimesTen works for Exalytics, though. A OLAP query against a pre-aggregated cube does not graze an entire mart or warehouse. It is contained and “small data” (for my wacky take re: Exalytics and Exadata see here).
TimeTen is not sharded… so scalability is an issue. Oracle gets around this nicely by allowing you to partition data across instances and have the application route queries to the appropriate server. But this approach will not support joins across partitions so it severely limits scalability in a general-purpose mart or warehouse.
SQLFire is a very interesting new product built on top of Gemfire… and therefore mature from the start. SQLFire is more scalable than TimesTen/Exalytics. It supports sharded data in a cluster of servers. But SQLFire has the limitation that it cannot join data across shards (they call them partitions… see here) so it will be hard to support ad hoc queries… They provide the ability to replicate tables to support any sort of joins. If, for example, you replicate small dimension tables to coexist with sharded fact tables all joins are supported. This solution is problematic if you have multiple fact tables which must be joined… and replication of data uses more memory… but SQLFire has the foundation in place to become BI-capable over time.
Performance in an in-memory database comes first and foremost from eliminating disk I/O. All three IMDB product provide this capability. Then performance comes from the efficient use of compression. TimeTen incorporates Oracles dictionary-based “columnar” compression (I so hate this term… it is designed to make people think that Oracle products are sort-of columnar… but so far they are not). Then performance comes from columnar projection… the ability to avoid touching all data in a row to process a query. Neither TimesTen nor SQLFire are columnar databases. Then performance comes from parallel execution. Neither TimesTen nor SQLFire can involve all cores on a single query to my knowledge.
Price comes from compression as well. The more highly compressed the data is the less memory required to store it. Further, if data can be used without decompressing it, then less working memory is required. As noted, TimesTen has a compression capability. SQLFire does not appear to compress data. Neither can use compressed data. Note that 2X compression cuts the amout of memory/hardware required in half or more… 4X cuts it to a quarter… and so on. So this is significant.
Now for some transparency… I started the research for this blog, and composed a 1st draft, last Spring while I was at EMC Greenplum. I am now at SAP working with HANA. So… I will not go into HANA at great length… but I will point out that: HANA fully supports a shared-nothing architetcture… so it is fully scalable; HANA is fully parallel and able to use all cores for each query; HANA fully supports columnar tables so it provides deep compression and the ability to use the compressed data in execution. This is not remarkable as HANA was designed from the bottom up to support both BI and OLTP workloads while TimesTen and SQLFire started from a purely OLTP architectural foundation.
David Linthicum suggests here that Shadow IT is not all a bad thing. He references a PricewaterhouseCoopers study that suggests that 30% of all IT spending comes from the business directly… from outside of the IT budget.
In the data warehouse space we can confirm these numbers easily. Just google on “data mart consolidation” to see the impact of the business building their own BI infrastructure in order to get around the time-consuming strictures and bureaucratic processes that IT imposes on a classic EDW platform. Readers… think of the term “data governance”… governance implies bureaucracy. And a “single version of the truth” implies a monopoly (governed by IT). We need a market for ideas to support our business intelligence… and a market is a little chaotic.
What we need is a place where IT says to the business… we cannot get you integrated into our formal EDW infrastructure as fast as you would like… but don’t go and build your own warehouse/mart on your own shadow platform. Let us provide you with a mart in the cloud. Take the data you need from our EDW. Enhance it as you see fit. We can spin up a server to house the mart in the cloud in a couple of hours. Let us help you. Use the tools you want… we think that it is cool that you are going to try out some new stuff… but if you want to use the tools we provide then you’ll get the benefit of our licensing deal and the benefit of our support… but you decide. We need IT to allow a little chaos…
This, I believe is what cloud offers to the data warehouse space…. the platform to respond.
But there is a rub… data warehouse appliances from Teradata, Exadata, and Netezza require bundled hardware that is not going to fit in your cloud. A shared-nothing architecture is a tough fit into the shared disk paradigm of the cloud (see here). The I/O reliance of a disk-based DBMS make performance tough on a shared disk platform. I think that for data marts and analytic sandboxes the cloud is the right choice… if you want to minimize the size of the shadow IT cast by lines of business. An in-memory database (IMDB): HANA, TimesTen, or SQLFire may be the best alternative for a small cloud-based mart.
David Linthicum has it right in spades for the data warehouse space… we need some user pull-through… and we need cloud computing as the platform to make these user-driven initiatives manageable.
One more surprise…
In the past SAP applications have, in general, avoided using database features. Even a SELECT with a projection was out-of-bounds. They did not want to depend on any database, so they tended to pull all data from the data layer to the application layer and loop through the data using procedural languages like ABAP. You might say that they were religiously database agnostic. My mistake… you might say that we were religiously database agnostic. I have to get used to these new surroundings.
Besides the obvious attributes of HANA: in-memory, shared-nothing, MPP, and column-oriented… the aim is to move the application logic next to the data and into HANA.
Any of you who have labored to convert procedural code into set-based SQL will understand the issue here. There are hundreds of thousands or millions of lines of procedural code… often very simple loops… that have to be converted to SQL to make the HANA architecture support the SAP application portfolio.
The surprise is not that there is this outstanding issue.. nor is it the ambitious architecture designed to push the application deep into the database (we are not talking about SQL-based stored procedures… we are talking about the application). The surprise is that the HANA development team has built a state-of-the-art facility that programmatically converts procedural logic into its set-based equivalent (not necessarily into SQL but sometimes into a language that can execute in-parallel). This is not a tool requiring manual intervention… it is an automatic, mathematically provable, transformation.
Right now the technique is used to covert logic in stored-procedures and in ABAP. But I hope to see it applied in the optimizer to convert those ugly Oracle cursor loops on-the-fly.
You can read more here.
By the way… SAP will continue to support ABAP using the database as a file server… moving all of the data from the database server to the application server for processing. But you can imagine that… when running applications that use this powerful capability… over time HANA will emerge with a huge performance advantage over other databases…
Oracle should be worried.
I would like to point you to two articles in the latest Northern California Oracle Users Group (NoCOUG) Journal here.
The first is an interview of Kevin Closson here. The interview is long and will take some time to get through… so set aside 30 minutes… it will be worth it as Kevin discusses Exadata, shared-nothingness, and other topics related to database hardware architecture.
The second article I would like to suggest (by the way there are several other excellent articles) is by Dr. Bert Scalzo. He reminds us that our job as engineers is to build the most cost-effective solution… not to build the perfect solution. He suggests that hardware should be treated as a dynamic resource that can be provisioned easily to solve performance problems.
I have argued that in a shared-nothing, scalable, architecture it is often cheaper to add another $20,000 fat server than to spend $100,000 of staff time to tune around a performance problem. This is especially true when the tuning involves building indexes and materialized views or pre-aggregated tables that make your warehouse fragile and more difficult to tune the next time. See here…
Back to Kevin’s interview and to tie the two articles together… Kevin suggests that as long as data flows into the CPUs fast enough then there is no reason to pick a shared-nothing architecture over a shared-everything architecture. He insists on symmetry and rightfully points out that a shared-everything system can be symmetrical. But it is more difficult to maintain symmetry as you scale up a shared-everything system… and easy scale is what is required to treat hardware as a dynamic resource. So… I remain convinced that shared-nothing is the way to go…
I have promised not to promote HANA heavily on this site… and I will keep that promise. But I want to share something with you about the HANA architecture that is not part of the normal marketing in-memory database (IMDB) message: HANA is parallel from its foundation.
What I mean by that is that when a query is executed in-memory HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard.
Other shared-nothing implementations like Teradata and Greenplum, which are not built on a native parallel architecture, start multiple instances of the database to take advantage of multiple cores. If they can start an instance-per-core then they approximate the parallelism of a native implementation… at the cost of inter-instance communication. Oracle, to my knowledge, does not parallelize steps within a single instance… I could be wrong there so I’ll ask my readers to help?
As you would expect, for analytics and complex queries this architecture provides a distinct advantage. HANA customers are optimizing price models sub-second in-real-time with each quote instead of executing a once-a-week 12-hour modeling job.
As you would expect HANA cannot yet stretch into the petabyte range. The current HANA sweet spot is for warehouses or marts is in the sub-TB to 20TB range.
In the previous post here I suggested that a SAN-based, cloudy, EDW is about 4X the cost for the same performance over a data warehouse appliance.. and I described why. I have actually seen this comparison.
It is difficult to compare Amazon EC2 hardware to the hardware typically assembled in a shared-nothing EDW cluster whether the hardware is from HP, Dell, Sun, IBM, or Teradata. So let’s assume that Amazon gets a 20% edge due to huge volume purchases over your firm. Note that this is a significant edge since the hardware is a commodity. Further, lets assume that Amazon gets another 30% edge in TCO on system administration costs. This is the cost of staff to manage the Linux OS and the hardware components. This may also be generous to the Amazon side of the equation. The numbers are not important… you can put in whatever seems to model your situation best… if you work for a large efficient company the numbers may go down for EC2.
Lets also assume that you reserve and receive dedicated hardware on EC2. This will not be the case but lets continue to build a best-case scenario for EC2.
From these numbers we can assume that the EC2 configuration will be 3X the cost for the same performance as a dedicated purpose-built database cluster. Again this assumes that the EC2 hardware is dedicated so this number is optimistic.
So why would anyone do this? Because EC2 has no up-front capital expense associated… it is an operating expense. This is significant.
So what is the advantage of buying ParAccel on EC2? I’m unsure. ParAccel has not done particularly well in the marketplace… but it is not clear that this is a technology issue. The answer could lie in the fact that companies deploy ParAccel on EC2 for data mart or application-specific workloads that may not use 100% of the hardware resources provided?
I think that if you work through these three blogs you can get an idea of how to model the opportunity for yourself. If the ability to spend OPEX dollars with Amazon is important… even if you need 3X the hardware… then this is a very interesting way to go.
But do not imagine that you are getting the same performance with ParAccel on EC2 that you wold get with ParAccel on HP or Dell… for a fraction of the price. There is no architectural advantage in ParAccel on EC2 over Vertica or Greenplum or any other DBMS that can run on EC2… ParAccel is, however, trying something new and interesting… if you understand the trade-offs.
In the last blog of this series I’ll discuss some new approaches that may change the game… including another interesting possibility for ParAccel going forward.
In Part 1 of this topic (here) I suggested that cloud computing has the ability to be elastic… to expand and maybe contract the infrastructure as CPU, memory, or storage requirements change. I also suggested that the workload on an EDW is intense and static to point out that there was no significant advantage to consolidating non-database workloads onto an over utilized EDW platform.
But EDW workload does flex some with the business cycle… quarter end reporting is additive to the regular daily workload. So maybe an elastic stretch to add resources and then a contraction has value? It most probably does add value.
The reason shared-nothing works is because it builds on a sharded model that splits the data across nodes and lets the CPU and I/O bandwidth scale together. This is very important… the limiting factor in these days of multi-core CPUs is I/O bandwidth and many nodes plus shards provides the aggregate I/O bandwidth of all disk controllers in the cluster.
What does that mean with regards to building an elastic data warehouse? It means that with each elastic stretch the data has to be re-deployed across the new number of shards. And because the data to be moved is embedded in blocks it means that the entire warehouse, every block, has to be scanned and re-written. This is an expensive undertaking on disk… one that bottlenecks at the disk controller and one that bottlenecks worse if there are fewer controllers (for example in in a SAN environment). Then, when the configuration is to shrink it process is repeated. In reality the cost of th I/Oe resources to expand and contract does not justify the benefit.
So… we conclude that while it is technically possible to build an elastic EDW it is not really optimal. In every case it is feasible to build a cloud-based EDW… it is possible to deploy a shared-nothing architecture, possible to consolidate workloads, and possible to expand and contract… but it is sub-optimal.
The real measure of this is that in no case would a cloud-based EDW proof-of-concept win business over a stand-alone cluster. The price of the cloudy EDW would be 2X for 1/2 the performance… and it is unlikely that the savings associated with cloud computing could make up this difference (the price of SAN is 2X that of JBOD and the aggregate I/O bandwidth is 1/2… for the same number of servers… hence the rough estimates). This is why EMC offers a Data Computing Appliance without a SAN. Further, this 4X advantage assumes that 100% of the SAN-based cluster is dedicated to the EDW. If 50% of the cluster is shared with some other workload then the performance drops by that 50%.