This post has been thrown at me a couple of times now… so I’ll now take the time to go through it… and try to address the junk.
It starts by suggesting that “the Germans” have started a war… but the next sentence points out that the author tossed grenades at HANA two months before the start he suggests. It also ignores the fact that the HANA post in question was a response to incorrect public statements by a Microsoft product manager about HANA (here).
The author suggests some issue with understanding clustered indexes… Note that “There are 2 implementations of xVelocity columnstore technology: 1. Non clustered index which is read only – this is the version available in SMP (single node) SQL Server 2012. 2. Columnstore as a clustered index that is updateable – This is the version available in MPP or PDW version of SQL 2012.”. The Microsoft documentation I read did not distinguish between the two and so I mistakenly attributed features of one to the other. Hopefully this clears up the confusion.
He suggests that the concept of keeping redundant versions of the data… one for OLTP and one for BI is “untrue”… I believe that the conventional way to deal with OLTP and BI is to build separate OLTP and BI databases… data warehouses and data marts. So I stand by the original comment.
The author rightfully suggests that I did not provide a reference for my claim that there are odd limitations to the SQL that require hand-coding… here they are (see the do’s and dont’s).
He criticizes my statement that shared-nothing gave us the basis for solving “big data”. I do not understand the criticism? Nearly very large database in the world is based on a shared-nothing architecture… and the SQL Server PDW is based on the same architecture in order to allow SQL Server to scale.
He is critical of the fact that HANA is optimized for the hardware and suggests that HANA does not support Intel’s Ivy Bridge. HANA is optimized for Ivy Bridge… and HANA is designed to fully utilize the hardware… If we keep it simple and suggest that using hardware-specific instruction sets and hardware-specific techniques to keep data in cache together provide a 50X performance boost [This ignores the advantages of in-memory and focusses only on hw-specific optimizations... where data in cache is either 15X (L3) or 20X (L2) or 200X (L1) faster than data fetched from DRAM... plus 10X or more using super-computer SIMD instructions], I would ask… would you spend 50X more for under-utilized hardware if you had a choice? SAP is pursuing a distinct strategy that deserves a more thoughtful response than the author provided.
He accuses me of lying… lying… about SQL being architected for single-core x286 processors. Sigh. I am unaware of a rewrite of the SQL Server product since the 286… and tacking on support for modern processors is not re-architecting. If SQL Server was re-architected from scratch since then I would be happy to know that I was mistaken… but until I hear about a re-write I will assume the SQL Server architecture, the architecture, is unchanged from when Sybase originally developed it and licensed it to Microsoft.
He says that HANA is cobbled together from older piece parts… and points to a Wikipedia page. But he does not use the words in the article… that HANA was synthesized from other products and , as stated in the next sentence, built on: “a new application architecture“. So he leaves the reader to believe that there is nothing new… he is mistaken. HANA is more than a synthesis of in-memory, column-store, and shared-nothing. It includes a new execution engine built on algorithms from the search space… columns in the column store are processed as vectors rather than the rote tuple-by-tuple approach from the 1980′s. It includes powerful in-database support for procedural languages with facilities that convert loops to fully parallel set-based processes. It provides, as noted above, a unique approach to supporting OLTP and BI queries in the same instance (see here)… and more. I’m not trying to hype HANA here… time and the market will determine if these new features are important… but there is no doubt that they are new.
I did not find the Business Intelligist post to be very informative or helpful. With the exception of the Wikipedia article mentioned above there is only unsubstantiated opinion in the piece… … and a degree of rudeness that is wholly uncalled for.
This is a rehash of my post for SAP here… I thought you might find it interesting as it describes the architecture HANA uses to support OLTP and BI against a single table.
A couple of points to think about:
- If you have only one database structure you can optimize for only one query; e.g. the OLTP query is fast against a OLTP structure but slow against a BI structure… or visa versa.
- If you have two structures you have to ETL the data between the two at some cost. There is cost in keeping a replica of the data, cost in developing, administering, and executing the ETL process. In addition there is a lost opportunity cost hidden in the latency of the data. You cannot see the current state of the business by querying the BI data as some data has not yet been ETL’d across.
- OLTP performance is normally paramount; so the perfect system would not compromise that performance or compromise it only a little.
Let’s look at the HANA approach to this at a high level.
HANA provides a single view of a table to an application or a user, but under-the-covers each table includes a OLTP optimized part, a BI optimized part, and a mechanism for moving data from one part to the other
When a transaction hits the system; inserts, updates, and deletes are processed in the OLTP part with no performance penalty. The read portion of the OLTP query accesses the read-optimized internal structure with no performance penalty. Note that reading a single column in a column store, which is the key for the transaction, is roughly equivalent to reading an index structure on top of a standard disk-based DBMS. Except the column is always in-memory which means I/O is never required. This provides the HANA system with an advantage over a disk-based system. Disk I/O is 120+ times slower than memory access so even an index is unlikely to beat in-memory. See here for some numbers you should know.
After the transaction is committed into the internal, OLTP-optimized part, a process starts that moves the data to the BI optimized part. This is called a delta merge as the OLTP portion holds all of the changes, the delta, in the data set.
When a BI query starts it can limit the scan to only partitions in the BI optimized part, or if real-time data is required it can scan both parts. The small portion of the scan that accesses the OLTP/delta portion is sub-optimal when compared to the scan of the BI part, but not slow at all as the data is all in-memory.
We can tease the performance apart as follows:
- There is a OLTP insert/update/delete “write” portion… and HANA executes this like any OLTP database, as fast as an OLTP RDBMS, with a commit after a write-to-log;
- There is a OLTP select “read” portion… and HANA performs this in the in-memory column store faster than many OLTP databases… and scans the delta structure as fast as any OLTP database;
- There is a delta merge from the OLTP write-optimized part to the BI read-optimized column store that is hundreds to tens of thousands of times faster than any ETL tool; and
- There is a BI select portion that scans the in-memory column store hundreds to thousands of times faster than a disk-based BI database.
- If the BI query requires access to real-time data then an in-memory scan of the delta file is required… there is no analogy to this in a system with separate OLTP and BI tables.
There is still an open question over whether, after the Big Bang, there is enough mass in the Universe to slow the expansion and cause the universe to contract. While the Big Data Bang continues to expand the universe of bits and bytes… I would like to ask whether some of these numbers are overstated? I know that the sum of the bits and bytes is expanding but I wonder if the universe of information is expanding as much as we claim?
Note that by “information” I mean a unique combination of bits and bytes representing some new information. In other words, if the same information is copied redundantly over and over does that count?
There is a significant growth industry in deduplication software that can backup data without copying redundant information. The savings from these products is astounding. NetApp claims 70% of the unstructured data may be redundant (see here). Data Domain says that eliminating (and compressing) redundant data reduces storage requirements by 10X-30X (see here). What’s up with that?
In the data warehouse space it is just as bad. The same data lives in OLTP systems, ETL staging areas, Operational Data Stores, Enterprise Data Warehouses, Data Marts, and now Hadoop clusters. The same information is replicated in aggregate tables, indexes, materialized views, and cubes. If you go into many shops you can find 50TB of EDW data exploded into 500TB of sandboxes for the data scientists to play with. Data is stored in snapshots on an hourly basis where less than 10% of the data changes from hour to hour. There is redundancy everywhere. There is redundancy everywhere.
I believe that there is a data explosion… and I believe that it is significant… but there is also a sort of laziness about copying data.
Soon we will see in production the first systems where a single copy of OLTP and EDW and analytic data can reside in the same platform and be shared. It will be sort of shocking to see the Big Data Bang slow a little…
So far this blog has focused on issues related to database architecture… so this title might not seem on message. But architecture has implications.
The aim of any BI system is to support the decision-making process of the business. BI infrastructure is clearly a success when your company learns to make fact-based decisions as part of the day-to-day operation of the business. The best data warehouse in the world would be one that provides such effective decision support that the business gains a competitive advantage over the competition.
But I often run into companies where sweet success has turned sour. Why, because in these sour situations the BI eco-system cannot keep up. In these bad cases the best data warehouse in the world becomes the worst.
Usually the problem comes in one of two flavors: either the required decision support is unavailable in time to make a decision, or the eco-system cannot extend to support new business opportunities.
The first case usually shows up during periods when decision-making increases: during seasonal peaks in business. The second appears when the business grows: after a merger or when a new product is introduced. In both cases the cost of the failure is significant.
But these worst cases do not happen out of the blue. They creep up on you. There are symptoms. Often the first symptom is when the nightly reporting process starts missing its service level targets. That is, the nightly load of the warehouse and the refresh of the indexes, materialized views, the summary tables, the cubes, and the marts; and then the running of reports cannot complete in the batch window. This is followed by slow response in your online query processing as the nightly process creeps into the day. Then, the business asks for more users and/or for more data to be added and the problem grows… until decision-making is delayed or unsupported altogether.
Sadly, this problem is avoidable and the solution is well understood. All that is required is a scalable foundation that can extend through the addition of relatively inexpensive hardware. If you could easily add storage and compute then as the constraints hit you can scale up.
A shared nothing architecture scales. We have examples at Greenplum of production systems that scale from hundreds of gigabytes to thousands of terabytes… and other shared nothing vendors: Teradata and Netezza at least, can boast the same. When our customers run out of gas we add hardware. And the architecture scales bigger still… shared nothing is the foundation for all web scale data base technology… scaling to hundreds of petabytes.
So why do companies build, and continue to build, on shared memory systems with built-in limits? Because… they continually underestimate the growth in data… the failure is a failure of vision (consider the name “Teradata”… selected when a terabyte was considered nearly unreachable). Data does not just grow, it explodes in leaps and bounds as technology advances.
But let’s be real… Why do companies really select limiting infrastructure? Because they mistakenly believe that they can build BI infrastructure on technology designed for OLTP… and they already have DBAs trained on this technology who heavily influence the decision. Or, they have an enterprise license for the OLTP database and they want to save some money.
I imagine that I’ve made my point. The worst data warehouse in the world is a warehouse that constrains your business… one that cannot scale as the demand for data and decision support grows… one that costs you hundreds of thousands of dollars in staff time with every change… one that is tuned to the breaking point, rather than robust.
Why would anyone ever put their business at risk like this?