In the post here I listed the units of parallelism (UoP) applied by various products on a single node. Those findings are summarized in the table below.
Cores per Node
UoP per Node
|Greenplum||DCA UAP Edition||
|Recommends 1 Segment for each 2 cores. Maybe some multi-threading per query so it could be greater than 8 on the average… and could be 16 with hyper-threads… but not more than 32 for sure.|
|Maybe only 12… cannot find if they use hyper-threads.|
|May use hyper-threads but limited by 16 FPGAs.|
|HANA||Any Xeon E7-4800||
A UoP is defined as the maximum number of instructions that can execute in parallel on a single node for a single query. Note that in the comments there was a lively debate where some readers wanted to count threads or processes or slices that were “active” but in a wait state. Since any program can start threads that wait I do not count these as UoP (later we might devise a new measure named units of waiting that would gauge the inefficiency in any given design by measuring the amount of waiting around required to keep the CPUs fed… maybe the measure would be valuable in measuring the inefficiency of the queue at your doctor’s office or at any government agency).
On some CPUs vendors such as Intel allow two threads to execute instructions in-parallel in a core. This is called hyper-threading and, if implemented, it allows for two UoP on a single core. Rather than constantly qualify the statements for the rest of this blog when I refer to cores I mean to imply hyper-threads.
The lively comments in the blog included some discussion of the sort of techniques used by vendors to try and keep the cores in the CPU on each node fed. It is these techniques that lead to more active I/O streams than cores and more threads than cores.
For several years now Intel and the other CPU manufacturers have been building ever more cores into their products. This has allowed them to continue the trend known as Moore’s Law. Multi-core is now a fact of life and even phones, tablets, and personal computers have multi-core chips.
But if you look at the table you can see that the database products above, even the newly announced products from Teradata and Netezza, are using CPUs with relatively few cores. The high-end Intel processors have 40 cores and the databases, with the exception of HANA, use Intel products with at most 16 cores. Further, Intel will deliver Ivy Bridge processors to the market this year with 120 cores. These vendors know this… yet they have chosen to deliver appliances with the previous generation CPUs. You might ask why?
I believe that there is an architectural reason for this (also a marketing reason covered here).
It is very hard to keep 80 cores fed with data when you have to perform block I/O. It will be nearly impossible to keep the 240 cores coming with Ivy Bridge fed. One solution is to deploy more nodes in a shared-nothing configuration with fewer cores per node… but this will be expensive requiring more power, floorspace, administration, etc. This is the solution taken by most of the vendors above. Another solution is to solve the problem without I/O with an in-memory database (IMDB) architecture. This is the solution taken by SAP with HANA.
Intel, IBM, and the rest will continue to build out using the multi-core approach for the foreseeable future. IMDB products will be able to fully utilize this product. Other products will struggle to take full advantage as we can see already… they will adapt and adjust and do what they can… but ultimately IMDB will win, I think… because there is just no other way to keep up as Moore’s Law continues to drive technology… no other way to feed the CPU engines with data fast enough.
If I am right then you will see more IMDB offerings from more vendors, including from the major vendors in the near future (note that this does not include the announcements of “database in memory” from Oracle which is not by any measure an in-memory database).
This is the underlying reason why Donald Feinberg (and Timo Elliott) are right on here. Every organization will be running in-memory… and soon.
6 May… There is a summary of this post and on the comments here. - Rob
17 April… A single unit of parallelism is a core plus a thread/process to feed it instructions plus a feed of data. The only exception is when the core uses hyper-threading… in which case 2 instructions can execute more-or-less at the same time… then a core provides 2 units of parallelism. All of the other stuff: many threads per core and many data shards/slices per thread are just techniques to keep the core fed. – Rob
16 April… I edited this to correct my loose use of the word “shard”. A shard is a physical slice of data and I was using it to represent a unit of parallelism. – Rob
I made the observation in this post that there is some inefficiency in an architecture that builds parallel streams that communicate on a single node across operating system boundaries… and these inefficiencies can limit the number of parallel streams that can be deployed. Greenplum, for example, no longer recommends deploying a segment instance per core on a single node and as a result not all of the available CPU can be applied to each query.
This blog will outline some other interesting limits on the level of parallelism in several products and on the definition of Massively Parallel Processing (MPP). Note that the level of parallelism is directly associated with performance.
Exadata deploys 12 cores per cell/node in the storage subsystem. They deploy 12 disk drives per node. I cannot see it clearly documented how many threads they deploy per disk… but it could not be more than 24 units of parallelism if they use hyper-threading of some sort. It may well be that there are only 12 units of parallelism per node (see here).
Updated April 16: Netezza deploys 8 “slices” per S-Blade… 8 units of parallelism… one for each FPGA core in the Twin times four (2X4) Twinfin architecture (see here). The next generation Netezza Striper will have 16-way parallelism per node with 16 Intel cores and 16 FPGA cores…
Updated April 17: Teradata uses hyper-threading (see here)… so that they will deploy 24 units of parallelism per node on an EDW 6700C (2X6X2) and 32 units of parallelism per node on an EDW 6700H (2X8X2).
You can see the different definitions of the word “massive” in these various parallel processing systems.
Note that the next generation of Xeon processors coming out later this year will have 8X15 processors or 120 cores on a fat node:
- This will provide HANA with the ability to deploy 240 units of parallelism per node.
- Netezza will have to find a way to scale up the FPGA cores per S-Blade to keep up. TwinFin will have to become QuadFin or DozenFin. It became HexadecaFin… see above. – Rob
- Exadata will have to put 120 SSD/disk drive combos in each node instead of 12 if they want to maintain the same parallelism-to-disk ratio with 120 units of parallelism.
- Teradata will have to find a way to get more I/O bandwidth on the problem if they want to deploy nodes with 120+ units of parallelism per node.
Most likely all but HANA will deploy more nodes with a smaller number of cores and pay the price of more servers, more power, more floor space, and inefficient inter-node network communications.
So stay tuned…
Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…
To help pay the bills please consider this as you read on…
OK, I hate Oracle marketing (see here and here). They are happy to skirt the edge of the credible too often. But let’s be real… Exadata was a very smart move… even if it a flawed product. The flaws are painful but not fatal… and Oracle can now play in the data warehouse space in places they could not play before. I do not believe that Exadata is a strong competitor as you will see below… it will not win many “fair” POCs… but the fight will be more than close enough to make customers with existing Oracle warehouses pick Exadata once they consider the cost of migration. This is tough… it means that customers are locked in to a relatively weak alternative… and every Oracle customer (and every Teradata customer and every SQL Server customer and every DB2 customer) should consider the long-term costs of vendor lock-in. But each customer has to weigh this for themselves… and this evaluation of the cost of lock-in is about neither architecture nor marketing…
Where They Win
First and foremost Exadata wins when there is an existing data warehouse or data mart on Oracle that will have to be migrated. My recommendation to customers is that they think about this carefully before they engage other vendors. It is a waste of everybody’s time to consider alternatives when in the end no alternative has a chance… and it is a double waste to do a POC when even a big technical win by a competitor cannot win them the business.
Exadata can win technically when the data “working set” is small. This allows Exadata to keep the hot data in SSD and in memory and better still, in the RAC layer. This allows Oracle to win POCs where that can suggest a subset of the EDW data is all that is required.
Exadata can win when the queries required, or tested, contain highly selective predicates that can be pushed down in the first steps of the explain plan. Conversely, Exadata bonks when lots of data must be pulled to the RAC layer to perform a join step.
Where They Lose
Everyone who has an Exadata system or who is considering one should view the two videos here. The architectural issues are apparent… and you can then consider the impact for your workload.
As noted above… in an Exadata execution plan the early simple table scans and projection are executed in the storage layer… subsequent steps occur in the RAC layer… if lots of data has to be moved up then the cluster chokes.
There are times when the architectural limitations are just too large and a migration is required to meet the response time requirements for the business. This often happens when Exadata is to support a single application rather than a data warehouse workload… In other words, if the cost of migrating away from Oracle is small, either because the applications to be moved are small or because an automated tool is available to mitigate the cosy or because the migration costs are subsidized by another source, then Exadata can lose even when there is a migration required.
Exadata can be beat on price… unless you count the cost of migration.
In the Market
For the reasons above, Exadata wins for current Oracle customers. There was a honeymoon when Exadata was winning some greenfield deals against other competitors… but these are now more rare.
My Guess at the Future
I think that the basic architecture of Exadata is defensible… having a split configuration is , after all, not completely foreign. Teradata and Greenplum and others use master nodes split from data nodes… and this is where is I predict we’ll see Oracle go. Over time, more execution steps will move to the storage layer and out of the RAC layer and in the end, Exadata will look ever more like a shared-nothing implementation. This just has to be the architectural way forward for Exadata (but don’t expect LE to stand up anytime soon and admit that he was wrong all of these years about the value of a shared-nothing architecture).
Phil has alerted us that there will be some OLTP/BI enhancements coming (see the comments section here)… which stole away a prediction I would have made otherwise.
The bottlenecks pointed out by Kevin Closson (as above and more here) need to be addressed… but to some extent these issues are the result of hardware constraints… and the combination of better hardware configurations and the push-down of more execution steps can mitigate many of the issues.
It will be a while before the Exadata architecture evolves to a point where the product is more competitive… and from now to then I think the World will be as I described it above… Oracle zealots will pick Exadata either as a religious stance or to avoid the cost of a migration… others will mostly go elsewhere…
Coming next… my 2 Cents on Netezza…
There seems to be a sort of odd tradition for bloggers to look back at the past year as the New Year starts to unfold. Here is my review of my posts and some presents
Far and away the most viewed post was Exalytics vs. HANA What are they thinking? This simply notes that these two products are not really comparable sharing only the descriptor “in-memory”.
My Favorite Post
I liked this the best… ’nuff said: What is Big Data?
OK, here is my 2nd favorite: A Quick Five Minute Rule Update for In-memory Databases, but you probably need to read the prequel first: The Five Minute Rule and In-memory Databases
These papers and the underlying thinking by smarter folks than I will inform you about the definition of Hot Data from the point of pure IT economics.
The Most Under-rated Post
This is the post I thought was the most important… as it might strongly influence data warehouse platform buying decisions over the next few years… And it might even influence the stocks you pick: The Future of Hadoop and Big Data DBMSs
Some Other Posts to Read
Here are two posts that informed me:
The Five Minute Rule… This will point you to a Wikipedia article that will point you to the whole series of papers.
What Every Programmer Should Know About Memory… This paper goes into gory detail about how memory works inside a processor. It is hardware-centric for you software folks… but provides the basis for understanding why in-memory DBMSs are fast and why Exadata is not an in-memory DBMS.
And some other Good Stuff
Kevin Closson on Exadata
Thank you for your attention last year. I hope that each of you has a safe, prosperous, and happy new year…
I posted a blog on the SAP site here that discussed the implications of mobile clients. I want to re-emphasize the issue as it is crucial.
While at Greenplum we routinely replaced older EDW platforms and provided stunning performance. I recall one customer in particular where we were given a query that ran in 7 hours and Greenplum executed the query in seven seconds. This was exceptional… more typical were cases where we reduced run-times from several hours to under 30 minutes… to 10 minutes… to 5 minutes. I’m sure that every major competitor: Teradata, Greenplum, Netezza, and Exadata has similar stories to tell.
But 5 minutes will not cut it if you are servicing a mobile client where sub-second response to the device is a requirement… and 10 minutes is out of the question. It does not matter if it ran in 10 hours before… 10 minute response is not acceptable to a mobile device.
Today we see sub-second response delivered to our phones by custom applications built on special high-performance platforms designed specifically to service a mobile client: iPhones, iPads, and Android devices.
But what will we do about the BI applications built on commercial platforms which have just used every trick in the book to become one of the 5 minute stories mentioned above?
I think that there are only a couple of architectural choices.
- We can rewrite the high-value queries as custom applications using specialized infrastructure… at great expense… and leaving the vast majority of queries un-serviced.
- We can apply the 80/20 rule to get the easiest queries serviced with only 20% of the effort. But according to Murphy the 20% left will be the highest value queries.
- We can tack on expensive, specialized, accelerators to some queries… to those that can be accelerated… but again we leave too much behind.
- Or we can move to a general purpose high performance computing platform that can service the existing BI workload with sub-second response.
In-memory computing will play a role… Exalytics provides option #3… HANA option #4.
SSD devices may play a role… but the performance improvements being quoted by vendors who use SSD as a block I/O device is 10X or less. A 10X improvement applied to a query that was just improved to 10 minutes yields a 1 minute query… still not the expected level of service.
IT departments will have to evaluate the price/performance, not just the price, as they consider their next platform purchases. The definition of adequate response is changing… and the old adequate, at the least cost, may not cut it. Mobile clients are here to stay. The productivity gains expected from these devices is significant. High performance BI computing is going to be a requirement.
Here is an attempt to build a Price/Performance model for several data warehouse databases.
Added on February 21, 2013: This attempt is very rough… very crude… and a little too ambitious. Please do not take it too literally. In the real world Greenplum and Teradata will match or exceed the price/performance of Exadata… and the fact that the model does not show this exposes the limitations of the approach… but hopefully it will get you thinking… – Rob
For price I used some $$/Terabyte numbers scattered around the internet. They are not perfect but they are close enough to make the model interesting. I used:
Of these numbers the one that may be the furthest off is the HANA number. This is odd since I work for SAP… but I just could not find a good number so I picked a big number to see how the model came out. Please, for any of these numbers provide a comment and I’ll adjust.
For each product I used the high performance product rather than the product with large capacity disks…
I used latency as a stand-in for performance. This is not perfect either… but it is not too bad. I’ll try again some other time and add data transfer time to the model. Note that I did not try to account for advantages and disadvantages that come from the software… so the latency associated with I/O to spool/work files is not counted… use of indexes and/or column store is not counted… compression is not counted. I’ll account for some of this when I add in transfer times.
I did try to account for cache hits when there is SSD cache in the configuration… but I did not give HANA credit for the work done to get most data from the processor caches instead of from DRAM.
For network latency I just assumed one round trip for each product…
For latencies I used the picture below:
The exception is that for products that use PCIe to access SSDs I cut the latency by 1/3 based on some input from a vendor. I could not find details on the latency for Teradata’s Bynet so I assumed that it is comparable with Infiniband and the newest 10GigE switches.
Here is what I came up with:
|HANA (2 nodes)||
I suppose that if a model seems to reflect reality then it is useful?
HANA has the lowest latency because it is in-memory. When there are two nodes a penalty is paid for crossing the network… this makes sense.
Exadata does well because the X3 product has SSD cache and I assumed an 80% hit ratio.
Teradata does a little worse because I assumed a lower hit ratio (they have less SSD per TB of data).
Greenplum does worse as they do all I/O against disks.
Note the penalty paid whenever you have to go to disk.
Let me say again… this model ignores lots of software features that would affect performance… but it is pretty interesting as a start…
Wikipedia defines computer memory as:
In computing, memory refers to the physical devices used to store programs (sequences of instructions) or data (e.g. program state information) on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast (i.e. RAM), as a distinction from secondary memory, which are physical devices for program and data storage which are slow to access but offer higher memory capacity. Primary memory stored on secondary memory is called “virtual memory“.
The term “storage” is often (but not always) used in separate computers of traditional secondary memory such as tape, magnetic disks and optical discs (CD-ROM and DVD-ROM). The term “memory” is often (but not always) associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices.
To a computer program like a DBMS, memory is a resource allocated using commands like malloc() and calloc(). Note that these commands allocate primary memory using the definition above. From this you should conclude that an in-memory DBMS (IMDB) is a system that puts all of its data into memory allocated by the database program.
In their announcements this week Oracle states (here) that Exadata 3 is an in-memory database machine and Larry Ellison said. “Everything is in memory. All of your databases are in-memory. You virtually never use your disk drives. Disk drives are becoming passe. They’re good at storing images and a lot of data we don’t access very often.”
But their definition of in-memory includes SSD devices that are not directly addressable by the DBMS. In fact they use 22TB of SSDs and 4TB of DRAM. The SSDs are a cache sitting between the DBMS and disk storage. They are storage according to Wikipedia.
Exadata 3 is not an in-memory database machine. It takes more than lots of hardware to make a DBMS an in-memory DBMS.
Oracle is spewing marketing, not architecture.
David Linthicum suggests here that Shadow IT is not all a bad thing. He references a PricewaterhouseCoopers study that suggests that 30% of all IT spending comes from the business directly… from outside of the IT budget.
In the data warehouse space we can confirm these numbers easily. Just google on “data mart consolidation” to see the impact of the business building their own BI infrastructure in order to get around the time-consuming strictures and bureaucratic processes that IT imposes on a classic EDW platform. Readers… think of the term “data governance”… governance implies bureaucracy. And a “single version of the truth” implies a monopoly (governed by IT). We need a market for ideas to support our business intelligence… and a market is a little chaotic.
What we need is a place where IT says to the business… we cannot get you integrated into our formal EDW infrastructure as fast as you would like… but don’t go and build your own warehouse/mart on your own shadow platform. Let us provide you with a mart in the cloud. Take the data you need from our EDW. Enhance it as you see fit. We can spin up a server to house the mart in the cloud in a couple of hours. Let us help you. Use the tools you want… we think that it is cool that you are going to try out some new stuff… but if you want to use the tools we provide then you’ll get the benefit of our licensing deal and the benefit of our support… but you decide. We need IT to allow a little chaos…
This, I believe is what cloud offers to the data warehouse space…. the platform to respond.
But there is a rub… data warehouse appliances from Teradata, Exadata, and Netezza require bundled hardware that is not going to fit in your cloud. A shared-nothing architecture is a tough fit into the shared disk paradigm of the cloud (see here). The I/O reliance of a disk-based DBMS make performance tough on a shared disk platform. I think that for data marts and analytic sandboxes the cloud is the right choice… if you want to minimize the size of the shadow IT cast by lines of business. An in-memory database (IMDB): HANA, TimesTen, or SQLFire may be the best alternative for a small cloud-based mart.
David Linthicum has it right in spades for the data warehouse space… we need some user pull-through… and we need cloud computing as the platform to make these user-driven initiatives manageable.
I would like to point you to two articles in the latest Northern California Oracle Users Group (NoCOUG) Journal here.
The first is an interview of Kevin Closson here. The interview is long and will take some time to get through… so set aside 30 minutes… it will be worth it as Kevin discusses Exadata, shared-nothingness, and other topics related to database hardware architecture.
The second article I would like to suggest (by the way there are several other excellent articles) is by Dr. Bert Scalzo. He reminds us that our job as engineers is to build the most cost-effective solution… not to build the perfect solution. He suggests that hardware should be treated as a dynamic resource that can be provisioned easily to solve performance problems.
I have argued that in a shared-nothing, scalable, architecture it is often cheaper to add another $20,000 fat server than to spend $100,000 of staff time to tune around a performance problem. This is especially true when the tuning involves building indexes and materialized views or pre-aggregated tables that make your warehouse fragile and more difficult to tune the next time. See here…
Back to Kevin’s interview and to tie the two articles together… Kevin suggests that as long as data flows into the CPUs fast enough then there is no reason to pick a shared-nothing architecture over a shared-everything architecture. He insists on symmetry and rightfully points out that a shared-everything system can be symmetrical. But it is more difficult to maintain symmetry as you scale up a shared-everything system… and easy scale is what is required to treat hardware as a dynamic resource. So… I remain convinced that shared-nothing is the way to go…
I have promised not to promote HANA heavily on this site… and I will keep that promise. But I want to share something with you about the HANA architecture that is not part of the normal marketing in-memory database (IMDB) message: HANA is parallel from its foundation.
What I mean by that is that when a query is executed in-memory HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard.
Other shared-nothing implementations like Teradata and Greenplum, which are not built on a native parallel architecture, start multiple instances of the database to take advantage of multiple cores. If they can start an instance-per-core then they approximate the parallelism of a native implementation… at the cost of inter-instance communication. Oracle, to my knowledge, does not parallelize steps within a single instance… I could be wrong there so I’ll ask my readers to help?
As you would expect, for analytics and complex queries this architecture provides a distinct advantage. HANA customers are optimizing price models sub-second in-real-time with each quote instead of executing a once-a-week 12-hour modeling job.
As you would expect HANA cannot yet stretch into the petabyte range. The current HANA sweet spot is for warehouses or marts is in the sub-TB to 20TB range.