Cloudera and Teradata have jointly published a nice paper here that presents an interesting perspective of how Hadoop and an EDW play together. Simply put, Hadoop becomes the staging area for “raw data streams” while the EDW stores data from “operational systems”. Hadoop then analyzes the raw data and shares the results with the EDW. Two early examples provided suggest:
- Click stream data is analyzed to identify customer preferences that are then shared with the EDW. Note that the amount of data sent from Hadoop to the EDW would be fairly small in this case.
- Detailed data is stored on Hadoop to build analytic models. The models are then transferred to the EDW to score sales activity data. Note that in this scenario the scored activity detail has to live in Hadoop to perform modeling… but it is unclear why it has to live in the EDW as well. I presume that scoring takes place on the EDW instead of in Hadoop for performance reasons… but maybe the data, the modeling, and the scoring should just take place in Hadoop?
The paper then positions Hadoop as an active archive. I like this idea very much. Hadoop can store archived data that is only accessed once a month or once a quarter or less often… and that data can be processed directly by Hadoop programs or shared with the EDW data using facilities such as Teradata’s SQL-H, or Greenplum‘s External Hadoop tables (not by HAWQ, though… see here), or by other federation engines connected to HANA, SQL Server, Oracle, etc.
But think about the implications on how much data has to stay in your EDW if you archive everything older than 90, or even 180, days to Hadoop. The EDW shrinks significantly and the TCO advantage to your Enterprise will be significant. This is very cool.
There is one item in the paper I disagree with, though… and another statement that I think has a very short shelf-life.
The paper suggests that indexes, materialized views, aggregate join indexes, and other tweaks are what differentiates an EDW. I believe that reliance on these structures make for a fragile EDW where only some queries can run fast. I like Teradata better when it just robustly scans fast and none of these redundant-data tuning artifacts are required (more here and here). Teradata was the original scan-fast DBMS… it is more than capable.
The paper also suggests that an EDW maintains value by including a sophisticated cost-based optimizer that uses data demographic statistics to identify an optimal query execution plan. I agree that Hadoop lacks this now… but there are several projects like Cloudera Impala that will eliminate this gap in the near term.
I believe that if you read between the lines you will see more evidence to support my belief (here) that Hadoop will squeeze the EDW vendors hard… and that the size of a squeezed EDW will then fit in an in-memory database.
If Greenplum HAWQ does not look promising (see my previous posts on HAWQ here and here) what are the prospects for Teradata Aster Data… which aspires to both replace and/or co-exist with Hadoop for a fee? Teradata+Hadoop maybe… but Teradata+Aster+Hadoop seems like one layer too many… as does Aster+Hadoop.
(OK, I removed the bad “HAWQing” pun in the title… no complaints from readers… it just seemed unfair… – Rob)
First, from a technical standpoint I like the Greenplum-on-HDFS HAWQ offering. It looks like the GP Team replaced XFS with HDFS and added some native support for several HDFS file types. I will say more on this soon.
Let me propose an analogy: Hadoop is an eco-system of open source components much like LINUX is an eco-system of open source components. If you think the analogy apt, then HAWQ on HDFS is not Hadoop any more than Microsoft Internet Explorer on LINUX is LINUX. Hive is open source and part of the Hadoop eco-system… as is Impala. Firefox is open source and part of the LINUX eco-system. HAWQ is not Hadoop.
The HortonWorks link points out that Greenplum is not engaged in the Hadoop eco-system as a contributor. They also quote Greenplum as saying that they have 300 developers working on Hadoop. Well… if HAWQ is part of Hadoop and HAWQ is the Greenplum database on HDFS then they have 300 developers on Hadoop. But if, as I suggested, HAWQ is not Hadoop then the number of Greenplum developers on Hadoop might be less. I bumped into a long-time Greenplum employee at Strata who told me that HAWQ was a skunkworks project with 4-5 developers max. This comes from a credible source… but it is still rumor-quality… so take it with a grain of salt.
The bottom line is that Greenplum has marketed very aggressively. They fuzz the definition of Hadoop to claim their commercial database offering running on Hadoop is therefore “Hadoop”. They fuzz the definition of developers working on Hadoop based on this first fuzz.
But does it matter? Greenplum will read and process data stored in HDFS faster than any other SQL-based engine. That is worth something.
But what is it worth? I’m fairly certain that the Greenplum databases will run faster off of Hadoop on XFS than in Hadoop… maybe significantly faster. So the reason for Greenplum on HDFS is faster SQL access to data in HDFS files.
This leads me to my question. I wonder… were the performance numbers quoted, showing a significant performance advantage over both HIVE and Impala, based on queries executed against the Greenplum proprietary table formats or against the same native HDFS file types read by HIVE? If they ran against Greenplum tables then I wonder what the real apples-to-apples comparison would show? Note that I am not being cynical here… I do not know how the tests were set up… only that Greenplum was fast. But if as I said: “ the reason for Greenplum on HDFS is faster SQL access to data in HDFS files” and the data was in Greenplum file structures accessible only by Greenplum then there is little reason left.
I also wonder if it matters because HIVE and Impala will improve their performance significantly over the next 12-24 months. The sheer amount of human R&D being expended here will allow these SQL engines to catch, or nearly catch, HAWQ in performance. If there is any gap left, the price and the community of open source offerings will defeat HAWQ in the market.
As I have suggested here… there is no apparent commercial opportunity competing against Hadoop at this point. I suggested here that Hadoop would eat Greenplum if they stuck to the analytics space and offered both products… effectively competing with themselves. This new strategy is not likely to work in the medium or long run. Greenplum is, indeed, all-in on Hadoop… but without a winning hand.
March 10: See here for the answers to my questions… – Rob
March 12: See here for a rethink on this subject… – Rob
- Greenplum HAWQ (dbms2.com)
- EMC morphs Hadoop elephant into SQL database Hawq (go.theregister.com)
- EMC to Hadoop competition: “See ya, wouldn’t wanna be ya.” (gigaom.com)
If I were the Register I would have titled this: Raging Stuffed Elephant To Devour Two Warehouse Vendors… I love the Register… if you do not read it have a look…
This is a post is about the market implications of architecture…
Let us assume that Hadoop matures and finds a permanent place in the market. This is not certain with some folks expressing concern (here) and others boundless enthusiasm (here). So let’s assume… and consider where it might fit.
One place is in the data warehouse market… This view says Hadoop replaces the DBMS for data warehouses. But the very mature BI/DW market requires a high level of operational integrity and Hadoop is not there yet… it is advancing rapidly as an enterprise platform and I believe it will get there… but it will be 3-4 years. This is the thinking I provided here that leads me to draw the picture in Figure 1.
It is not that I believe that Hadoop will consume the data warehouse market but I believe that very large EDW’s… those over 1PB… and maybe over 500TB will be compelled by the economics of “free” to move big warehouses to Hadoop. So Hadoop will likely move down into the EDW space from the top.
Another option suggests that Big Data will be a platform unto itself. In this view Hadoop will sit beside the existing BI/DW platform and feed that platform the results of queries that derive structure from unstructured data… and/or that aggregate Big Data into consumable chunks. This is where Hadoop sits today.
In data warehouse terms this positions Hadoop as a very large independent analytic data mart. Figure 2 depicts this. Note that an analytics data mart, and a Hadoop cluster, require far less in the way of operational infrastructure… they share very similar technical requirements.
This leads me to the point of this post… if Hadoop becomes a very large analytic data mart then where will Greenplum and Netezza fit in 2-3 years? Both vendors are positioning themselves in the analytic space… Greenplum almost exclusively so. Both vendors offer integrated Hadoop products… Greenplum offers the Greenplum database and Hadoop in the same hardware cluster (see here for their latest announcement)… Netezza provides a Hadoop connector (here). But if you believe in Hadoop… as both vendors ardently do… where do their databases fit in the analytics space once Hadoop matures and fully supports SQL? In the next 3-4 years what will these RDBMSs offer in the big data analytics space that will be compelling enough to make the configuration in Figure 3 attractive?
I know that today Hadoop cannot do all that either Netezza or Greenplum can do. I understand that Netezza has two positions in the market… as an analytic appliance and as a data mart appliance… so it may survive in the mart space. But the overlap of technical requirements between Hadoop and an analytic data mart… combined with the enormous human investment in Hadoop R&D, both in the core and in the eco-system… make me wonder about where “Big Data” analytic relational databases will fit?
Note that this is not a criticism of the Greenplum RDBMS. Greenplum is a very fine product, one of the best EDW platforms around. I’ll have more to say about it when I provide my 2 Cents… But if Figure 2 describes the end state for analytics in 2-3 years then where is the place for the Figure 3 architecture? If Figure 3 is the end state then I do not see where the line will be drawn between the analytic workload that requires Greenplum and that that will run on Hadoop? I barely can see it now… and I cannot see it at all in the near future.
Both EMC Greenplum and IBM seem to strongly believe in Hadoop… they must see the overlap in functionality and feel the market momentum of Hadoop. They must see, better than most, that Hadoop wins this battle.
There seems to be a sort of odd tradition for bloggers to look back at the past year as the New Year starts to unfold. Here is my review of my posts and some presents
Far and away the most viewed post was Exalytics vs. HANA What are they thinking? This simply notes that these two products are not really comparable sharing only the descriptor “in-memory”.
My Favorite Post
I liked this the best… ’nuff said: What is Big Data?
OK, here is my 2nd favorite: A Quick Five Minute Rule Update for In-memory Databases, but you probably need to read the prequel first: The Five Minute Rule and In-memory Databases
These papers and the underlying thinking by smarter folks than I will inform you about the definition of Hot Data from the point of pure IT economics.
The Most Under-rated Post
This is the post I thought was the most important… as it might strongly influence data warehouse platform buying decisions over the next few years… And it might even influence the stocks you pick: The Future of Hadoop and Big Data DBMSs
Some Other Posts to Read
Here are two posts that informed me:
The Five Minute Rule… This will point you to a Wikipedia article that will point you to the whole series of papers.
What Every Programmer Should Know About Memory… This paper goes into gory detail about how memory works inside a processor. It is hardware-centric for you software folks… but provides the basis for understanding why in-memory DBMSs are fast and why Exadata is not an in-memory DBMS.
And some other Good Stuff
Kevin Closson on Exadata
Thank you for your attention last year. I hope that each of you has a safe, prosperous, and happy new year…
I would like to recommend you to Barry Devlin’s post here titled “Big Data is Dead… Long Live All Data”. The post ends with the paragraph:
“All this says to me that big data as a technological category is becoming an increasingly meaningless name. Big data is essentially all data. Is there any chance that the marketing folks can hear me?”
I could not agree more. If “big data” is meaningful then, as I have argued, it must be a new thing associated with several newish sources of data that come in large volumes like social media data or sensor data or log data. But the term is so abused that I no longer believe that it is salvageable. Big Data is all data… it is any data…
So of course it is true that business must prepare for it (here), that cloud computing must support it (here), that it is more than just a technology issue (here), that organizations need to be aligned (here)… and so on (note that these are just the four most recent tweets on my feed… I could go on and on). How can this be the driver of new IT spending? How can it be the driver of anything?
The point is that everything that has ever been said about data and data warehousing is being restated as new thinking related to big data. If we measured the information entropy we would find no new information is present.
Big Data is Big Hype… Fuel for Bloggers and Pundits…
Related articles (Yawn)…
- Big Data To Become Powerful Driver Of IT Spending – Gartner (misco.co.uk)
- Big Data News of the Week: BYOD (forbes.com)
- Bigger Big Data, Big Thoughts, and Big Ideas (ibm.com)
- Enterprises Are Spending Wildly On ‘Big Data’ But Don’t Know If It’s Worth It Yet (isykes.wordpress.com)
- Big data: a retailer’s guide to likes, tweets, reviews, customer data, and basically everything else (infographic) (venturebeat.com)
- Agile Cloud, Big Data and Mobility (sys-con.com)
- Big Data – You Can Start Small (isykes.wordpress.com)
- Big data: a retailer’s guide to likes, tweets, reviews, customer data, and basically everything else (infographic) (c24.co.uk)
- Big Data and Big Decisions: Learning from the 2012 Election (zenya.com)
- Big Thinkers on Big Data [Video] (liveramp.com)
First, you should look at Google’s Spanner paper here… this is the next-gen from Google and once it is embraced by the open source community it will put even more pressure on the big data DBMSs. Also have a look at YARN the next Map/Reduce… more pressure still…
Next… you can imagine that the conventional database folks will quibble a little with my analysis. Lets try to anticipate the push-back:
- Hadoop will never be as fast as a commercial DBMS
Maybe not… but if it is close then a little more hardware will make up the difference… and “free” is hard to beat in price/performance.
I do not think so… disk controllers, the overhead of non-memory I/O, and an inability to fully optimize processing for in-memory will make a big difference. I said 50X to be conservative… but it could be 200X… and a 200X performance improvement reduces the memory required to process a query by 200X… so it adds up.
- The Price of IMDB will always be prohibitive
Nope. The same memory that is in SSD’s will become available as primary memory soon and the price points for SSD-based and IMDB will converge.
- IMDB won’t scale to 100TB
HANA is already there… others will follow.
- Commercial customers will never give up their databases for open source
Economics means that you pay me now or you pay me later… companies will do what makes economic sense.
The original post on this is here…
About four years ago Michael McIntire and I were pondering the rise of Hadoop. This blog will share bits of that conversation, provide an update based on the state of Hadoop today, and suggest a future state…
Briefly… we believed that the Hadoop eco-system was building all of the piece-parts of a very large database management system. You could see the basics: a distributed file system in HDFS, a low-level query engine in Map/Reduce with an abstraction in Pig, and the beginnings of optimization, SQL, availability, backup & recovery, etc.
We wondered why this process was underway… why would enterprises go to Hadoop when there were perfectly good relational VLDBs that could solve most of the problems… and where they could not… extending a mature RDBMS would be easier than the giant start-from-scratch of Hadoop.
We saw two reasons for the Hadoop project, process, and progress:
- Michael pointed out that the RDBMS vendors just did not understand how to price their products on “Big Data” (to be fair that term was not in use then)… if you have 7PB of data, as Michael did… then at the current $35K/TB list price the bill would be $245M. Even if you discounted to $1K/TB the tab would be $7M. The DBMS vendors were giving the big guys a financial incentive to Build instead of Buy… and so Google Built and Yahoo Built and Hadoop emerged.
- I pointed out that the academic community would support this… the ability to write a thesis based on new work in the DBMS space was becoming harder… but it was possible to sponsor papers that applied DBMS concepts to Hadoop and keep the PhD pipeline filled.
So there was funding, research, and development.
The narrative from here on is my own… Michael is off the hook..
Today Hadoop has a first release in the public domain. Dozens of companies are working to extend the core… some as contributors, some with a commercial interest, many with both incentives. The stack is maturing… and we now easily imagine a day when Hadoop will rival Teradata, Exadata, Netezza, and Greenplum in VLDB performance… with some product maturity and a rich set of features. And if Hadoop gets close and the price is free (or nearly so…) then the price/performance of Hadoop will make it unbeatable for “Big Data”.
In fact, the trigger for writing this piece now was the news a few weeks back from one of our Hadoop partners that HIVE was POC’d against one of the databases mentioned above on a big data problem and came close. The main query ran in 35 minutes on the DBMS and in 45 minutes with HIVE. The end is in sight… and sooner than expected.
What might this mean for the future?
Imagine a market where Hadoop can solve for big data problems… let’s say problems over 500TB just to draw a line… with the same performance as the best RDBMS, in a write-once/read-many use case like a data warehouse… for free. For FREE… plus the cost of the hardware. Hadoop wins… no contest.
Let’s suggest a market from 50TB to 500TB where a conventional RDBMS can out-perform Hadoop by 2X more-or-less… but Hadoop is free… so only applications where the performance matters can pay the price premium.
And let’s suggest a high performance in-memory database (IMDB) market that beats disk-based and SSD-based RDBMS by 50X for a 50% premium (based on new technologies like phase-change memory see here…) and can beat Hadoop by 1000X but at a higher cost.
You can see the squeeze:
- IMDB will own the high performance market… most-likely in the 100TB and under space…
- Hadoop will own the big data 500TB+ low-cost market…
- and the conventional DBMS vendors will fight it out for adequate-performance/medium-priced applications from 100TB to 500TB… with continued pressure from the top and the bottom.
Economics will drive this. The conventional DBMS vendors are moving to SSD’s… which increases their price in the direction of an IMDB… and increases their price/performance in the same good direction. But the same memory in SSD’s will soon be generally available as primary memory. So the IMDB prices and the conventional DBMS prices will converge… but the IMDB products will retain a 50X-100X performance advantage by managing the new memory as memory instead of as a peripheral device. Hadoop may or may not leverage SSD’s… but it will be free.
This week an “industry leader” stood in front of a large IT conference and stated that “big data” is any data volume or data complexity that puts you out of your comfort zone. This is not helpful. It makes the definition of big data subjective and psychological. I can see the cartoon now:
Dilbert: I just loaded some new data…
Freud: How does that make you feel, Dilbert?
Industry leaders are trying to get companies to come to grips with the software, hardware, and staffing/expertise issues related to a new opportunity. The operative word is “new”.
Here is the Google Trend for the term “Big Data“:
Big Data is new… it is NOT any data that makes you feel queasy. People have been uncomfortable with data since computing began.
Big data is about the collection, storage, and analysis of the detailed data that new technology is generating.
The problem is that everyone wants to use the phrase to expound whatever thing they have to sell: every product by every vendor supports big data… and every “industry leader” with every talk needs to include the phrase in the title of their talk and repeat it as many times as possible. So every data warehouse pitch is rehashed as a big data pitch, every data governance, master data management, OLAP, data mining, everything is now big data.
Let’s stop. I Big Data for one, Big Data refuse to Big Data pander to the Big Data boost one gets from Big Data using the phrase to get Big Data attention.
I almost forgot… here is my best previous post on the topic…
There is still an open question over whether, after the Big Bang, there is enough mass in the Universe to slow the expansion and cause the universe to contract. While the Big Data Bang continues to expand the universe of bits and bytes… I would like to ask whether some of these numbers are overstated? I know that the sum of the bits and bytes is expanding but I wonder if the universe of information is expanding as much as we claim?
Note that by “information” I mean a unique combination of bits and bytes representing some new information. In other words, if the same information is copied redundantly over and over does that count?
There is a significant growth industry in deduplication software that can backup data without copying redundant information. The savings from these products is astounding. NetApp claims 70% of the unstructured data may be redundant (see here). Data Domain says that eliminating (and compressing) redundant data reduces storage requirements by 10X-30X (see here). What’s up with that?
In the data warehouse space it is just as bad. The same data lives in OLTP systems, ETL staging areas, Operational Data Stores, Enterprise Data Warehouses, Data Marts, and now Hadoop clusters. The same information is replicated in aggregate tables, indexes, materialized views, and cubes. If you go into many shops you can find 50TB of EDW data exploded into 500TB of sandboxes for the data scientists to play with. Data is stored in snapshots on an hourly basis where less than 10% of the data changes from hour to hour. There is redundancy everywhere. There is redundancy everywhere.
I believe that there is a data explosion… and I believe that it is significant… but there is also a sort of laziness about copying data.
Soon we will see in production the first systems where a single copy of OLTP and EDW and analytic data can reside in the same platform and be shared. It will be sort of shocking to see the Big Data Bang slow a little…