First, from a technical standpoint I like the Greenplum-on-HDFS HAWQ offering. It looks like the GP Team replaced XFS with HDFS and added some native support for several HDFS file types. I will say more on this soon.
Let me propose an analogy: Hadoop is an eco-system of open source components much like LINUX is an eco-system of open source components. If you think the analogy apt, then HAWQ on HDFS is not Hadoop any more than Microsoft Internet Explorer on LINUX is LINUX. Hive is open source and part of the Hadoop eco-system… as is Impala. Firefox is open source and part of the LINUX eco-system. HAWQ is not Hadoop.
The HortonWorks link points out that Greenplum is not engaged in the Hadoop eco-system as a contributor. They also quote Greenplum as saying that they have 300 developers working on Hadoop. Well… if HAWQ is part of Hadoop and HAWQ is the Greenplum database on HDFS then they have 300 developers on Hadoop. But if, as I suggested, HAWQ is not Hadoop then the number of Greenplum developers on Hadoop might be less. I bumped into a long-time Greenplum employee at Strata who told me that HAWQ was a skunkworks project with 4-5 developers max. This comes from a credible source… but it is still rumor-quality… so take it with a grain of salt.
The bottom line is that Greenplum has marketed very aggressively. They fuzz the definition of Hadoop to claim their commercial database offering running on Hadoop is therefore “Hadoop”. They fuzz the definition of developers working on Hadoop based on this first fuzz.
But does it matter? Greenplum will read and process data stored in HDFS faster than any other SQL-based engine. That is worth something.
But what is it worth? I’m fairly certain that the Greenplum databases will run faster off of Hadoop on XFS than in Hadoop… maybe significantly faster. So the reason for Greenplum on HDFS is faster SQL access to data in HDFS files.
This leads me to my question. I wonder… were the performance numbers quoted, showing a significant performance advantage over both HIVE and Impala, based on queries executed against the Greenplum proprietary table formats or against the same native HDFS file types read by HIVE? If they ran against Greenplum tables then I wonder what the real apples-to-apples comparison would show? Note that I am not being cynical here… I do not know how the tests were set up… only that Greenplum was fast. But if as I said: “ the reason for Greenplum on HDFS is faster SQL access to data in HDFS files” and the data was in Greenplum file structures accessible only by Greenplum then there is little reason left.
I also wonder if it matters because HIVE and Impala will improve their performance significantly over the next 12-24 months. The sheer amount of human R&D being expended here will allow these SQL engines to catch, or nearly catch, HAWQ in performance. If there is any gap left, the price and the community of open source offerings will defeat HAWQ in the market.
As I have suggested here… there is no apparent commercial opportunity competing against Hadoop at this point. I suggested here that Hadoop would eat Greenplum if they stuck to the analytics space and offered both products… effectively competing with themselves. This new strategy is not likely to work in the medium or long run. Greenplum is, indeed, all-in on Hadoop… but without a winning hand.
March 10: See here for the answers to my questions… – Rob
March 12: See here for a rethink on this subject… – Rob
- Greenplum HAWQ (dbms2.com)
- EMC morphs Hadoop elephant into SQL database Hawq (go.theregister.com)
- EMC to Hadoop competition: “See ya, wouldn’t wanna be ya.” (gigaom.com)