I renamed this so that Teradata folks would not get here so often… its not really about Intelligent Memory… just prompted by it. The post on Intelligent Memory is here. – Rob
Two quick comments on Teradata’s recent announcement of Intelligent Memory.
First… very very cool. More on this to come.
Next… life is going to become very hard for my readers and for bloggers in this space. The notion of an in-memory database is becoming rightfully blurred… as is the notion of column store.
Oracle blurs the concepts with words like “database in-memory” and “hybrid column compression” which is neither an in-memory database or a column store.
Teradata blurs the concept with a strong offering that uses DRAM as a block-IO device (like the old RAM-disks we used to configure on our PCs).
Teradata and Greenplum blur the idea of a column store by adding columnar tables over their row store database engines.
I’m not a fan of the double-speak… but the ability of companies to apply the 80/20 rule to stretch their architectures and glue on new advanced technologies is a good thing for consumers.
But it becomes very hard to distinguish the products now.
In future blogs I’ll try to point out differences… but we’ll have to go a little deeper into the Database Fog.
Cloudera and Teradata have jointly published a nice paper here that presents an interesting perspective of how Hadoop and an EDW play together. Simply put, Hadoop becomes the staging area for “raw data streams” while the EDW stores data from “operational systems”. Hadoop then analyzes the raw data and shares the results with the EDW. Two early examples provided suggest:
- Click stream data is analyzed to identify customer preferences that are then shared with the EDW. Note that the amount of data sent from Hadoop to the EDW would be fairly small in this case.
- Detailed data is stored on Hadoop to build analytic models. The models are then transferred to the EDW to score sales activity data. Note that in this scenario the scored activity detail has to live in Hadoop to perform modeling… but it is unclear why it has to live in the EDW as well. I presume that scoring takes place on the EDW instead of in Hadoop for performance reasons… but maybe the data, the modeling, and the scoring should just take place in Hadoop?
The paper then positions Hadoop as an active archive. I like this idea very much. Hadoop can store archived data that is only accessed once a month or once a quarter or less often… and that data can be processed directly by Hadoop programs or shared with the EDW data using facilities such as Teradata’s SQL-H, or Greenplum‘s External Hadoop tables (not by HAWQ, though… see here), or by other federation engines connected to HANA, SQL Server, Oracle, etc.
But think about the implications on how much data has to stay in your EDW if you archive everything older than 90, or even 180, days to Hadoop. The EDW shrinks significantly and the TCO advantage to your Enterprise will be significant. This is very cool.
There is one item in the paper I disagree with, though… and another statement that I think has a very short shelf-life.
The paper suggests that indexes, materialized views, aggregate join indexes, and other tweaks are what differentiates an EDW. I believe that reliance on these structures make for a fragile EDW where only some queries can run fast. I like Teradata better when it just robustly scans fast and none of these redundant-data tuning artifacts are required (more here and here). Teradata was the original scan-fast DBMS… it is more than capable.
The paper also suggests that an EDW maintains value by including a sophisticated cost-based optimizer that uses data demographic statistics to identify an optimal query execution plan. I agree that Hadoop lacks this now… but there are several projects like Cloudera Impala that will eliminate this gap in the near term.
I believe that if you read between the lines you will see more evidence to support my belief (here) that Hadoop will squeeze the EDW vendors hard… and that the size of a squeezed EDW will then fit in an in-memory database.
With the recent announcements of DB2 BLU and column store I suspect that DB2 will outperform Netezza when the query mix does not fall directly in Netezza’s sweet spot.
I also have a suspicion that the Netezza architecture, with its execution engine split across two different processors, is just hard to engineer. I cannot think of another reason features come so slowly there. Why, for example, is there no columnar support? Greenplum built it on the same Postgres base with less than a handful of engineers in a year. Teradata now offers columnar tables as well.
These concerns… combined with some previous notes on Netezza add up as follows:
- FPGAs no longer provide a performance advantage (per my link above)
- FPGAs limit the ability of the DBMS to use more cores (see here)
- FPGAs limit the ability of the DBMS to manage workload (see here… and especially the comments)
- FPGAs and having a 2-phase split execution environment limits the ability to extend and enhance the code base (a new conjecture)
- Zone Maps and CBTs provide a limited ability to solve for a wide range of queries… they are just an index (see here)
- DB2 Column Store provides a performance boost equal to or greater than zone maps and CBTs (a new conjecture)
- DB2 BLU provides a performance boost well in excess of what Netezza can provide (see here)
The Netezza architecture with FPGAs provided a distinct advantage in 2000 when CPU was the scarce commodity. But multi-core systems and the advance of Moore’s Law soon made processing abundant… and the advantage of FPGA co-processing diminished. Without a distinct advantage the split execution architecture became a disadvantage… and the complexity of that design kept Netezza from developing the advances on top of the Postgres base that were very easy to develop by others.
Architecture counts… and DB2 is a strong product. If, as I suspect, DB2 is now a more capable product than Netezza… I wonder what path IBM may take?
In the post here I listed the units of parallelism (UoP) applied by various products on a single node. Those findings are summarized in the table below.
Cores per Node
UoP per Node
|Greenplum||DCA UAP Edition||
|Recommends 1 Segment for each 2 cores. Maybe some multi-threading per query so it could be greater than 8 on the average… and could be 16 with hyper-threads… but not more than 32 for sure.|
|Maybe only 12… cannot find if they use hyper-threads.|
|May use hyper-threads but limited by 16 FPGAs.|
|HANA||Any Xeon E7-4800||
A UoP is defined as the maximum number of instructions that can execute in parallel on a single node for a single query. Note that in the comments there was a lively debate where some readers wanted to count threads or processes or slices that were “active” but in a wait state. Since any program can start threads that wait I do not count these as UoP (later we might devise a new measure named units of waiting that would gauge the inefficiency in any given design by measuring the amount of waiting around required to keep the CPUs fed… maybe the measure would be valuable in measuring the inefficiency of the queue at your doctor’s office or at any government agency).
On some CPUs vendors such as Intel allow two threads to execute instructions in-parallel in a core. This is called hyper-threading and, if implemented, it allows for two UoP on a single core. Rather than constantly qualify the statements for the rest of this blog when I refer to cores I mean to imply hyper-threads.
The lively comments in the blog included some discussion of the sort of techniques used by vendors to try and keep the cores in the CPU on each node fed. It is these techniques that lead to more active I/O streams than cores and more threads than cores.
For several years now Intel and the other CPU manufacturers have been building ever more cores into their products. This has allowed them to continue the trend known as Moore’s Law. Multi-core is now a fact of life and even phones, tablets, and personal computers have multi-core chips.
But if you look at the table you can see that the database products above, even the newly announced products from Teradata and Netezza, are using CPUs with relatively few cores. The high-end Intel processors have 40 cores and the databases, with the exception of HANA, use Intel products with at most 16 cores. Further, Intel will deliver Ivy Bridge processors to the market this year with 120 cores. These vendors know this… yet they have chosen to deliver appliances with the previous generation CPUs. You might ask why?
I believe that there is an architectural reason for this (also a marketing reason covered here).
It is very hard to keep 80 cores fed with data when you have to perform block I/O. It will be nearly impossible to keep the 240 cores coming with Ivy Bridge fed. One solution is to deploy more nodes in a shared-nothing configuration with fewer cores per node… but this will be expensive requiring more power, floorspace, administration, etc. This is the solution taken by most of the vendors above. Another solution is to solve the problem without I/O with an in-memory database (IMDB) architecture. This is the solution taken by SAP with HANA.
Intel, IBM, and the rest will continue to build out using the multi-core approach for the foreseeable future. IMDB products will be able to fully utilize this product. Other products will struggle to take full advantage as we can see already… they will adapt and adjust and do what they can… but ultimately IMDB will win, I think… because there is just no other way to keep up as Moore’s Law continues to drive technology… no other way to feed the CPU engines with data fast enough.
If I am right then you will see more IMDB offerings from more vendors, including from the major vendors in the near future (note that this does not include the announcements of “database in memory” from Oracle which is not by any measure an in-memory database).
This is the underlying reason why Donald Feinberg (and Timo Elliott) are right on here. Every organization will be running in-memory… and soon.
6 May… There is a summary of this post and on the comments here. - Rob
17 April… A single unit of parallelism is a core plus a thread/process to feed it instructions plus a feed of data. The only exception is when the core uses hyper-threading… in which case 2 instructions can execute more-or-less at the same time… then a core provides 2 units of parallelism. All of the other stuff: many threads per core and many data shards/slices per thread are just techniques to keep the core fed. – Rob
16 April… I edited this to correct my loose use of the word “shard”. A shard is a physical slice of data and I was using it to represent a unit of parallelism. – Rob
I made the observation in this post that there is some inefficiency in an architecture that builds parallel streams that communicate on a single node across operating system boundaries… and these inefficiencies can limit the number of parallel streams that can be deployed. Greenplum, for example, no longer recommends deploying a segment instance per core on a single node and as a result not all of the available CPU can be applied to each query.
This blog will outline some other interesting limits on the level of parallelism in several products and on the definition of Massively Parallel Processing (MPP). Note that the level of parallelism is directly associated with performance.
Exadata deploys 12 cores per cell/node in the storage subsystem. They deploy 12 disk drives per node. I cannot see it clearly documented how many threads they deploy per disk… but it could not be more than 24 units of parallelism if they use hyper-threading of some sort. It may well be that there are only 12 units of parallelism per node (see here).
Updated April 16: Netezza deploys 8 “slices” per S-Blade… 8 units of parallelism… one for each FPGA core in the Twin times four (2X4) Twinfin architecture (see here). The next generation Netezza Striper will have 16-way parallelism per node with 16 Intel cores and 16 FPGA cores…
Updated April 17: Teradata uses hyper-threading (see here)… so that they will deploy 24 units of parallelism per node on an EDW 6700C (2X6X2) and 32 units of parallelism per node on an EDW 6700H (2X8X2).
You can see the different definitions of the word “massive” in these various parallel processing systems.
Note that the next generation of Xeon processors coming out later this year will have 8X15 processors or 120 cores on a fat node:
- This will provide HANA with the ability to deploy 240 units of parallelism per node.
- Netezza will have to find a way to scale up the FPGA cores per S-Blade to keep up. TwinFin will have to become QuadFin or DozenFin. It became HexadecaFin… see above. – Rob
- Exadata will have to put 120 SSD/disk drive combos in each node instead of 12 if they want to maintain the same parallelism-to-disk ratio with 120 units of parallelism.
- Teradata will have to find a way to get more I/O bandwidth on the problem if they want to deploy nodes with 120+ units of parallelism per node.
Most likely all but HANA will deploy more nodes with a smaller number of cores and pay the price of more servers, more power, more floor space, and inefficient inter-node network communications.
So stay tuned…
The following performance numbers are being reported publicly for HANA:
- HANA scans data at 3MB/msec/core
- On a high-end 80-core server this translates to 240GB/sec per node
- HANA inserts rows at 1.5M records/sec/core
- Or 120M records/sec per node…
- Aggregates 12M records/sec/core
- Or 960M records per node…
These numbers seem reasonable:
- A 100X improvement over disk-based scan (The recent EMC DCA announcement claimed 2.4GB/sec per node for Greenplum)…
- Sort of standard OLTP insert speeds for a big server…
- Huge performance gains for in-memory aggregation using columnar orientation and SIMD HPC instructions…
Note that these numbers are the basis for suggesting that there is a new low-TCO approach to BI that eliminates aggregate tables, materialized views, cubes, and indexes… and eliminates the operational overhead of computing these artifacts… and still provides a sub-second response for all queries.
I want to soften my criticism of Greenplum‘s announcement of HAWQ a little. This post by Merv Adrian convinced me that part of by blog here looked at the issue of whether HAWQ is Hadoop too simply. I could outline a long chain of logic that shows the difficulty in making a rule for what is Hadoop and what is not (simply: MapR is Hadoop and commercial… Hadapt is Hadoop and uses a non-standard file format… so what is the rule?). But it is not really important… and I did not help my readers by getting sucked into the debate. It is not important whether Greenplum is Hadoop or not… whether they have committers or not. They are surely in the game and when other companies start treating them as competitors by calling them out (here) it proves that this is so.
It is not important, really, whether they have 5 developers or 300 on “Hadoop”. They may have been over-zealous in marketing this… but they were trying to impress us all with their commitment to Hadoop… and they succeeded… we should not doubt that they are “all-in”.
This leaves my concern discussed here over the technical sense in deploying Greenplum on HDFS as HAWQ… or deploying Greenplum in native mode with the UAP Hadoop integration features which include all of the same functionality as HAWQ… and 2x-3X better performance.
It leaves my concern that their open source competition more-or-less matches them in performance when queries are run against non-proprietary, native Hadoop, data structures… and my concerns that the community will match their performance very soon in every respect.
It is worth highlighting the value of HAWQ’s very nearly complete support for the SQL standard against native Hadoop data structures. This differentiates them. Building out the SQL dialect is not a hard technical problem these days. I predict that there will be very nearly complete support for SQL in an open source offering in the next 18-24 months.
These technical issues leave me concerned with the viability of Greenplum in the market. But there are two ways to look at the EMC Pivotal Initiative: it could be a cloud play… in which case Greenplum will be an uncomfortable fit; or it could be an open source play… in which case, here comes the wacky idea, Greenplum could be open-sourced along side Cloud Foundry and then this whole issue on committers and Hadoopiness becomes moot. Greenplum is, after all, Postgres under the covers.
If Greenplum HAWQ does not look promising (see my previous posts on HAWQ here and here) what are the prospects for Teradata Aster Data… which aspires to both replace and/or co-exist with Hadoop for a fee? Teradata+Hadoop maybe… but Teradata+Aster+Hadoop seems like one layer too many… as does Aster+Hadoop.
(OK, I removed the bad “HAWQing” pun in the title… no complaints from readers… it just seemed unfair… – Rob)
My contacts from Strata read my post here and provided me with the following information:
- The performance numbers quoted for Greenplum HAWQ versus HIVE and Impala used Greenplum tables implemented over HDFS. In other words, this data is unreadable from outside of the Greenplum database… unreadable by any other program in the Hadoop eco-system… a proprietary format. If the tests were re-run using the same open data structures used by HIVE and Impala you would find the performance of HAWQ to be closer to, or worse than, those Hadoop components.
- The HAWQ performance numbers quoted represent a 2X-3X performance degradation over the same benchmark run on the native Greenplum RDBMS.
Again… this is from a credible source… but please consider this a rumor… and view this report, and the associated Greenplum marketing… with an appropriate measure of engineering skepticism.
Greenplum is a fantastic product… if I assume the report to be true then I do not understand why are they doing this… what use case is solved by a 300% performance degradation accessing proprietary data in HDFS? Remember, you could put Greenplum in the same cluster as Hadoop (UAP) and query everything HAWQ could query without the performance degradation. I just do not see the point? Could someone from GP comment and help my readers and myself here?
- EMC morphs Hadoop elephant into SQL database Hawq (go.theregister.com)
- EMC touts screeching Hawq SQL performance for Hadoop (go.theregister.com)
- EMC launches Hadoop distribution, takes aim at Cloudera (zdnet.com)
- Greenplum HAWQ (dbms2.com)
Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…
Please consider this as you read on…
From a technical perspective, Greenplum is my favorite data warehouse database. Built on the same architecture as Teradata (see here), the Greenplum team was able to extend the core of Postgres… first building out a shared-nothing architecture and then adding feature after feature… putting the heat on the other major players. Greenplum was the first row-based RDBMS to add full columnar support… and their data-loading capability is second-to-none.
Oddly they do not want to be in the data warehouse space. Their recent announcement (here) does not include any reference to data warehousing or business intelligence. The tweets from @Greenplum, the Greenplum website, and all things marketing are focussed on analytics and/or Hadoop. Even their page on data warehousing (here) has no articles on data warehousing. It is just not their target market. That is fine… the product is still a great EDW platform… but it is a worry.
Where They Win
The reason they target analytics is because they excel there. If your warehouse workload clogs because of big, complex, queries… Greenplum can win the day. Their data flow architecture, which keeps tuples moving from execution step to execution step without writing to spool provides them with the ability to beat the competition on analytics. They provide a very rich set of in-database analytics and some add-on capabilities to improve the productivity of your data scientist team.
Their data load architecture, which they call scatter-gather, is a big differentiator. If your problem is that you cannot get data loaded and reports out in your nightly batch window then the combination of scatter-gather and the ability to run big report queries is unbeatable.
Greenplum also has a unique solution for near-real-time. They marry Gemfire, an in-memory object-oriented database, with scatter-gather to move small batches of inserted data to Greenplum with a very small time delta. I do not believe this solution supports inserts or deletes as they have to be applied directly to the Greenplum database… but it is a nice capability for a certain class of problems.
Where They Lose
Greenplum, like Teradata, can be beat when the problem to be solved is narrow. In these cases, when the database supports a single application with a small number of queries or when it supports a narrowly focussed data mart, they are vulnerable to Netezza, Vertica, or even Exadata. It is also sometimes the case that a poorly designed POC can narrow the scope enough that Greenplum loses.
Greenplum can also lose when a full EDW is required. The basic architecture of the RDBMS is capable of supporting an EDW… but some of the operational features required… RASR, workload, incremental backup, etc. are not mature. This may well be the intentional result of their focus away from these features at analytics.
In the Market
Despite the worries Greenplum should be included in every POC. They will push Teradata hard in performance and in price/performance.
As noted here… I do not understand their market strategy. It seems that they are competing with themselves by offering Hadoop for analytics… but this cannot be a bad thing for customers even if it is an odd position in the market. The analytics market they favor is tough… relatively small (compared to the DW space)… emerging… there are several capable competitors… and the market is haunted by the same problem that killed the data mining market in the mid-1990′s… there are just not enough skilled data scientists (see here).
My Guess at the Future
I cannot guess at the future of Greenplum… They are being moved into a new business unit that could be spun into a new company that has a charter to build software for the cloud (see here). This is odd in several dimensions. First, as I noted here, the shared nothing architecture Greenplum is built on is not a perfect fit for the cloud. There are ways to get around this (maybe the topic for a future post?) but it will require development in a fundamentally new direction. Further, the new division seems to be a software-only venture. This makes the future of the EMC Greenplum Data Computing Appliance uncertain. I suppose that there will be announcements soon to clarify these questions… but the architectural disconnects make it likely that there will be some arm-waving for a while.
Next up… my 2 Cents on The Rest…