My 2 Cents: Greenplum 1Q2013
Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…
Please consider this as you read on…
From a technical perspective, Greenplum is my favorite data warehouse database. Built on the same architecture as Teradata (see here), the Greenplum team was able to extend the core of Postgres… first building out a shared-nothing architecture and then adding feature after feature… putting the heat on the other major players. Greenplum was the first row-based RDBMS to add full columnar support… and their data-loading capability is second-to-none.
Oddly they do not want to be in the data warehouse space. Their recent announcement (here) does not include any reference to data warehousing or business intelligence. The tweets from @Greenplum, the Greenplum website, and all things marketing are focussed on analytics and/or Hadoop. Even their page on data warehousing (here) has no articles on data warehousing. It is just not their target market. That is fine… the product is still a great EDW platform… but it is a worry.
Where They Win
The reason they target analytics is because they excel there. If your warehouse workload clogs because of big, complex, queries… Greenplum can win the day. Their data flow architecture, which keeps tuples moving from execution step to execution step without writing to spool provides them with the ability to beat the competition on analytics. They provide a very rich set of in-database analytics and some add-on capabilities to improve the productivity of your data scientist team.
Their data load architecture, which they call scatter-gather, is a big differentiator. If your problem is that you cannot get data loaded and reports out in your nightly batch window then the combination of scatter-gather and the ability to run big report queries is unbeatable.
Greenplum also has a unique solution for near-real-time. They marry Gemfire, an in-memory object-oriented database, with scatter-gather to move small batches of inserted data to Greenplum with a very small time delta. I do not believe this solution supports inserts or deletes as they have to be applied directly to the Greenplum database… but it is a nice capability for a certain class of problems.
Where They Lose
Greenplum, like Teradata, can be beat when the problem to be solved is narrow. In these cases, when the database supports a single application with a small number of queries or when it supports a narrowly focussed data mart, they are vulnerable to Netezza, Vertica, or even Exadata. It is also sometimes the case that a poorly designed POC can narrow the scope enough that Greenplum loses.
Greenplum can also lose when a full EDW is required. The basic architecture of the RDBMS is capable of supporting an EDW… but some of the operational features required… RASR, workload, incremental backup, etc. are not mature. This may well be the intentional result of their focus away from these features at analytics.
In the Market
Despite the worries Greenplum should be included in every POC. They will push Teradata hard in performance and in price/performance.
As noted here… I do not understand their market strategy. It seems that they are competing with themselves by offering Hadoop for analytics… but this cannot be a bad thing for customers even if it is an odd position in the market. The analytics market they favor is tough… relatively small (compared to the DW space)… emerging… there are several capable competitors… and the market is haunted by the same problem that killed the data mining market in the mid-1990′s… there are just not enough skilled data scientists (see here).
My Guess at the Future
I cannot guess at the future of Greenplum… They are being moved into a new business unit that could be spun into a new company that has a charter to build software for the cloud (see here). This is odd in several dimensions. First, as I noted here, the shared nothing architecture Greenplum is built on is not a perfect fit for the cloud. There are ways to get around this (maybe the topic for a future post?) but it will require development in a fundamentally new direction. Further, the new division seems to be a software-only venture. This makes the future of the EMC Greenplum Data Computing Appliance uncertain. I suppose that there will be announcements soon to clarify these questions… but the architectural disconnects make it likely that there will be some arm-waving for a while.
Next up… my 2 Cents on The Rest…