Who is Massively Parallel? HANA vs. Teradata and (maybe) Oracle

I have promised not to promote HANA heavily on this site… and I will keep that promise. But I want to share something with you about the HANA architecture that is not part of the normal marketing in-memory database (IMDB) message: HANA is parallel from its foundation.

What I mean by that is that when a query is executed in-memory HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard.

Other shared-nothing implementations like Teradata and Greenplum, which are not built on a native parallel architecture, start multiple instances of the database to take advantage of multiple cores. If they can start an instance-per-core then they approximate the parallelism of a native implementation… at the cost of inter-instance communication. Oracle, to my knowledge, does not parallelize steps within a single instance… I could be wrong there so I’ll ask my readers to help?

As you would expect, for analytics and complex queries this architecture provides a distinct advantage. HANA customers are optimizing price models sub-second in-real-time with each quote instead of executing a once-a-week 12-hour modeling job.

June 11, 2013: You can find a more complete and up-to-date discussion of this topic here… – Rob

As you would expect HANA cannot yet stretch into the petabyte range. The current HANA sweet spot is for warehouses or marts is in the sub-TB to 20TB range.

About these ads

5 thoughts on “Who is Massively Parallel? HANA vs. Teradata and (maybe) Oracle

  1. Pingback: Who is Massively Parallel? HANA vs. Teradata and (maybe) Oracle … » BlinkMoth Software Industries | BlinkMoth Software Industries

  2. By saying that “HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard”, do you mean that query execution is a two step process, one for sharding and the other for executing? And how is the sharding of data performed? Is it correct to define it as a parallel step?

    Like

    • Remember Daniele… It’s all in memory… So ‘sharding’ is nothing more than dividing the memory blocks between the cores so that each core can execute independently on a chunk/shard. This also allows the cores to maintain data in their Ln cache structure without thrashing.

      Like

  3. Rob, not being a techie, can you explain in laymans terms (if possible), why HANA cannot scale into the Petabyte range? What restricts the system to top out at 20TB? Is this structural, or can that ceiling be raised over time?

    Like

    • Hi Vikas,

      What I tried to suggest is that the current HANA release has a sweet spot in the 20TB range… The product roadmap will stretch this out with each subsequent release. That is, there is no architectural limit. HANA can scale big. If you configured 1TB of RAM per node which provided enough memory to hold 5TB of compressed user data (5TB would be the conservative number… 15TB is possible)… then you would need 200 nodes to fit 1PB.

      All of the in-memory databases (IMDB) HANA, Times Ten, GemFire, etc. are betting that memory will become less expensive and bigger in the near term and that the amount of data that will fit on a node will go up.

      I have a blog drafted that compares these… I’ll try to get it posted… but the bottom line is that on a single node without column compression a DBMS could roughly fit 1TB of user data on a 1TB server (1TB of user data compressed 2X into 500GB of memory plus 500GB of memory for work space and for software). If the DBMS is not designed to support multiple nodes in a shared-nothing cluster… then a single node, 1TB, is the limit (for Times Ten & GemFire).

      For now, HANA has an advantage in that it supports columnar compression and shared-nothing… so it fits more on 1 node and supports many nodes…

      Like

Comments are closed.