David Linthicum suggests here that Shadow IT is not all a bad thing. He references a PricewaterhouseCoopers study that suggests that 30% of all IT spending comes from the business directly… from outside of the IT budget.
In the data warehouse space we can confirm these numbers easily. Just google on “data mart consolidation” to see the impact of the business building their own BI infrastructure in order to get around the time-consuming strictures and bureaucratic processes that IT imposes on a classic EDW platform. Readers… think of the term “data governance”… governance implies bureaucracy. And a “single version of the truth” implies a monopoly (governed by IT). We need a market for ideas to support our business intelligence… and a market is a little chaotic.
What we need is a place where IT says to the business… we cannot get you integrated into our formal EDW infrastructure as fast as you would like… but don’t go and build your own warehouse/mart on your own shadow platform. Let us provide you with a mart in the cloud. Take the data you need from our EDW. Enhance it as you see fit. We can spin up a server to house the mart in the cloud in a couple of hours. Let us help you. Use the tools you want… we think that it is cool that you are going to try out some new stuff… but if you want to use the tools we provide then you’ll get the benefit of our licensing deal and the benefit of our support… but you decide. We need IT to allow a little chaos…
This, I believe is what cloud offers to the data warehouse space…. the platform to respond.
But there is a rub… data warehouse appliances from Teradata, Exadata, and Netezza require bundled hardware that is not going to fit in your cloud. A shared-nothing architecture is a tough fit into the shared disk paradigm of the cloud (see here). The I/O reliance of a disk-based DBMS make performance tough on a shared disk platform. I think that for data marts and analytic sandboxes the cloud is the right choice… if you want to minimize the size of the shadow IT cast by lines of business. An in-memory database (IMDB): HANA, TimesTen, or SQLFire may be the best alternative for a small cloud-based mart.
David Linthicum has it right in spades for the data warehouse space… we need some user pull-through… and we need cloud computing as the platform to make these user-driven initiatives manageable.
- Shared-nothing is required for an EDW,
- An EDW is not usually under-utilized,
- There are difficulties in re-distributing sharded, shared-nothing data to provide elasticity, and
- A SAN cannot provide the same IO bandwidth per server as JBOD… nor hit the same price/performance targets.
Note that these issues are tied together. We might be able to spread the EDW workload over so many shards and so many SANs that the amount of I/O bandwidth per GB of EDW data is equal to or greater than that provided on a DW Appliance. This introduces other problems as there are typically overhead issues with a great many nodes. But it could work.
But what if we changed the architecture so that I/O was not the bottleneck? What if we built a cloud-based shared-nothing in-memory database (IMDB)? Now the data could live on SAN as it would only be read at start-up and written at shut-down… so the issues with the disk subsystem disappear… and issues around sharing the SAN disappear. Further, elasticity becomes feasible. With an IMDB we can add and delete nodes and re-distribute data without disk I/O… in fact it is likely that a column store IMDB could move column-compressed data without re-building rows. IMDB changes the game by removing the expense associated with disk I/O.
There is evidence emerging that IMDB technology is going to change the playing field (see here).
Right now there are only a few IMDB products ready in the market:
- TimeTen: which is not shared-nothing scalable, nor columnar, but could be the platform for a very small, 400GB or less (see here), cloud-based EDW;
- SQLFire: which is semi-shared-nothing scalable (no joins across shards), not columnar, but could be the platform for a larger, maybe 5TB, specialized EDW;
- ParAccel: which is shared-nothing scalable, columnar, but not fully an IMDB… but could be (see C. Monash here); or
- SAP HANA: which is shared-nothing, IMDB, columnar and scalable to 100TB (see here).
So it is early… but soon enough we should see real EDWs in the cloud and likely on Amazon EC2, based on in-memory database technologies.
In Part 1 of this topic (here) I suggested that cloud computing has the ability to be elastic… to expand and maybe contract the infrastructure as CPU, memory, or storage requirements change. I also suggested that the workload on an EDW is intense and static to point out that there was no significant advantage to consolidating non-database workloads onto an over utilized EDW platform.
But EDW workload does flex some with the business cycle… quarter end reporting is additive to the regular daily workload. So maybe an elastic stretch to add resources and then a contraction has value? It most probably does add value.
The reason shared-nothing works is because it builds on a sharded model that splits the data across nodes and lets the CPU and I/O bandwidth scale together. This is very important… the limiting factor in these days of multi-core CPUs is I/O bandwidth and many nodes plus shards provides the aggregate I/O bandwidth of all disk controllers in the cluster.
What does that mean with regards to building an elastic data warehouse? It means that with each elastic stretch the data has to be re-deployed across the new number of shards. And because the data to be moved is embedded in blocks it means that the entire warehouse, every block, has to be scanned and re-written. This is an expensive undertaking on disk… one that bottlenecks at the disk controller and one that bottlenecks worse if there are fewer controllers (for example in in a SAN environment). Then, when the configuration is to shrink it process is repeated. In reality the cost of th I/Oe resources to expand and contract does not justify the benefit.
So… we conclude that while it is technically possible to build an elastic EDW it is not really optimal. In every case it is feasible to build a cloud-based EDW… it is possible to deploy a shared-nothing architecture, possible to consolidate workloads, and possible to expand and contract… but it is sub-optimal.
The real measure of this is that in no case would a cloud-based EDW proof-of-concept win business over a stand-alone cluster. The price of the cloudy EDW would be 2X for 1/2 the performance… and it is unlikely that the savings associated with cloud computing could make up this difference (the price of SAN is 2X that of JBOD and the aggregate I/O bandwidth is 1/2… for the same number of servers… hence the rough estimates). This is why EMC offers a Data Computing Appliance without a SAN. Further, this 4X advantage assumes that 100% of the SAN-based cluster is dedicated to the EDW. If 50% of the cluster is shared with some other workload then the performance drops by that 50%.
My apologies… I was playing with the iPad version of WordPress and accidentally published a very rough outline/first draft of this post. I immediately un-published it… but not before subscribers were notified that there was a new post.
I wonder about the idea that data warehousing is suited to operate in the cloud? This was prompted by Paraccel‘s venture to deploy on the Amazon EC2 cloud infrastructure. Lets work through the architectural implications…
Here are the assumptions I’ll take into this exploration:
- A shared-nothing architecture is required to scale.
- Cloud infrastructure is cost-effective when the infrastructure is under-utilized and workloads can be consolidated to achieve full utilization… and not so cost-effective when the infrastructure is highly utilized. This is because applications can easily share underutilized resources in the Cloud.
- Cloud infrastructure is justified when the workload is inconsistent and either CPU or storage requirements fluctuate widely over the business cycle. This is because a Cloud is elastic and can easily flex as the requirements fluctuate. Cloud computing may not be well suited to static workload requirements.
You can probably see where I’m going with this from the assumptions.
In the end I’ll suggest that there is a database architecture that is suited to warehousing and cloud computing… but let me build to that.
Before I start let me also be clear that I am talking about the database infrastructure… not the application/BI infrastructure required for data warehousing. The BI and ETL components are perfectly suited to cloud computing… they reflect a workload that, in general, runs on under-utilized hardware with BI running during the day and ETL running at night. I have suggested this to my current employer… but alas, I am neither King nor a member of Court.
So in Part 1 let me discuss my first two assumptions and the implications… In Part 2 I’ll discuss data warehousing and elasticity… In Part 3 I’ll consider the Paraccel/Amazon collaboration and in Part 4 I’ll wrap up and consider several new things coming that may change the equations.
I’ll not work too hard to justify my first assumption… I think that it is well-understood that a shared-nothing architecture provides the best possible approach to scale out. Google and others use this approach to scale to hundreds of petabytes of data and Teradata, Greenplum, Netezza, Paraccel, SAP HANA, and others use it in the data warehouse space. Exadata uses a hybrid approach that scales I/O in a shared-nothing-like storage subsystem… but fails to scale as it passes data to the RAC layer (see Kevin Closson here on the subject).
But the implications are significant for our cloud discussion. First, cloud infrastructure is designed to support general client-server or web-server based commercial computing requirements. A shared-nothing database cluster is a specialized infrastructure optimized for database processing. Implementing the specialized problem on the generalized infrastructure is possible, but sub-optimal. Next, cloud computing requires, more or less, a shared storage subsystem. A shared-nothing architecture shares nothing. Implementing a shared-nothing database on a shared storage subsystem is possible, but sub-optimal.
I believe that the second assumption is also pretty straightforward. The primary rationale for cloud computing comes from the recognition that many data centers deployed applications on servers that were not fully utilized. By virtualizing the hardware on a cloud platform the data center could better service the applications with fewer hardware resources and therefore less cost.
So… in order for cloud computing to be a perfect fit we need to observe a data warehouse database workload with underutilized hardware infrastructure… You might ask yourself… are there underutilized hardware resources upon which my EDW is built? In most cases I believe that the answer to this question will be “no”. Almost every EDW I’ve seen is over-burdened… stretched… with users demanding more and more resource… more data, more users, more queries, deeper queries drive the resource requirements up exponentially. The database is swamped all day with queries and swamped all night by ETL and reporting tasks.
So let’s end this blog concluding that there is a problematic architectural mismatch between a shared cloud and a shared-nothing implementation… and that if your warehouse database platform is highly utilized then there may be little benefit from implementing a warehouse in the cloud.
See Part 2 here…