The Future of Hadoop and of Big Data DBMSs
About four years ago Michael McIntire and I were pondering the rise of Hadoop. This blog will share bits of that conversation, provide an update based on the state of Hadoop today, and suggest a future state…
Briefly… we believed that the Hadoop eco-system was building all of the piece-parts of a very large database management system. You could see the basics: a distributed file system in HDFS, a low-level query engine in Map/Reduce with an abstraction in Pig, and the beginnings of optimization, SQL, availability, backup & recovery, etc.
We wondered why this process was underway… why would enterprises go to Hadoop when there were perfectly good relational VLDBs that could solve most of the problems… and where they could not… extending a mature RDBMS would be easier than the giant start-from-scratch of Hadoop.
We saw two reasons for the Hadoop project, process, and progress:
- Michael pointed out that the RDBMS vendors just did not understand how to price their products on “Big Data” (to be fair that term was not in use then)… if you have 7PB of data, as Michael did… then at the current $35K/TB list price the bill would be $245M. Even if you discounted to $1K/TB the tab would be $7M. The DBMS vendors were giving the big guys a financial incentive to Build instead of Buy… and so Google Built and Yahoo Built and Hadoop emerged.
- I pointed out that the academic community would support this… the ability to write a thesis based on new work in the DBMS space was becoming harder… but it was possible to sponsor papers that applied DBMS concepts to Hadoop and keep the PhD pipeline filled.
So there was funding, research, and development.
The narrative from here on is my own… Michael is off the hook..
Today Hadoop has a first release in the public domain. Dozens of companies are working to extend the core… some as contributors, some with a commercial interest, many with both incentives. The stack is maturing… and we now easily imagine a day when Hadoop will rival Teradata, Exadata, Netezza, and Greenplum in VLDB performance… with some product maturity and a rich set of features. And if Hadoop gets close and the price is free (or nearly so…) then the price/performance of Hadoop will make it unbeatable for “Big Data”.
In fact, the trigger for writing this piece now was the news a few weeks back from one of our Hadoop partners that HIVE was POC’d against one of the databases mentioned above on a big data problem and came close. The main query ran in 35 minutes on the DBMS and in 45 minutes with HIVE. The end is in sight… and sooner than expected.
What might this mean for the future?
Imagine a market where Hadoop can solve for big data problems… let’s say problems over 500TB just to draw a line… with the same performance as the best RDBMS, in a write-once/read-many use case like a data warehouse… for free. For FREE… plus the cost of the hardware. Hadoop wins… no contest.
Let’s suggest a market from 50TB to 500TB where a conventional RDBMS can out-perform Hadoop by 2X more-or-less… but Hadoop is free… so only applications where the performance matters can pay the price premium.
And let’s suggest a high performance in-memory database (IMDB) market that beats disk-based and SSD-based RDBMS by 50X for a 50% premium (based on new technologies like phase-change memory see here…) and can beat Hadoop by 1000X but at a higher cost.
You can see the squeeze:
- IMDB will own the high performance market… most-likely in the 100TB and under space…
- Hadoop will own the big data 500TB+ low-cost market…
- and the conventional DBMS vendors will fight it out for adequate-performance/medium-priced applications from 100TB to 500TB… with continued pressure from the top and the bottom.
Economics will drive this. The conventional DBMS vendors are moving to SSD’s… which increases their price in the direction of an IMDB… and increases their price/performance in the same good direction. But the same memory in SSD’s will soon be generally available as primary memory. So the IMDB prices and the conventional DBMS prices will converge… but the IMDB products will retain a 50X-100X performance advantage by managing the new memory as memory instead of as a peripheral device. Hadoop may or may not leverage SSD’s… but it will be free.