There is a persistent myth, like a persistent cough, that claims that in-memory databases lose data when a hardware failure takes down a node because memory is volatile and non-persistent. This myth is marketing, not architecture.
- A write transaction arrives
- The transaction is written to the log file and committed… this is a very quick process with 1 sequential I/O… quicker still if the log file is on a SSD device
- The query updates the in-memory layer; and
- After some time passes, saves the in-memory data to disk.
Recovery for these databases is easy to understand:
- If a hardware failure occurs and #1 but before #2 the transaction has not been committed and is lost.
- If a hardware failure occurs after #2 but before #3 the transaction is committed and the database is rebuilt when the node restarts from the log file.
- If a hardware failure occurs after #3 but before #4 the same process occurs… the database is rebuilt when the node restarts from the log file.
- If a hardware failure occurs after #4 the database is rebuilt from the disk copy.
SQLFire uses a different approach (from here):
“Unlike traditional distributed databases, SQLFire does not use write-ahead logging for transaction recovery in case the commit fails during replication or redundant updates to one or more members. The most likely failure scenario is one where the member is unhealthy and gets forced out of the distributed system, guaranteeing the consistency of the data. When the failed member comes back online, it automatically recovers the replicated/redundant data set and establishes coherency with the other members. If all copies of some data go down before the commit is issued, then this condition is detected using the group membership system, and the transaction is rolled back automatically on all members.”
Redundant in-memory data optimizes transaction throughput but requires twice the memory. There are options to persist data to disk… but these options provide an approach that is significantly slower than the write-ahead logging used by TimesTen and HANA (and Oracle and Postgres, and …).
The bottom line: IMDBs are designed in the same manner as other, disk-based, DBMSs. They guarantee that comitted data is safe… everytime.
See here for how these DBMSs compare when a BI/analytic workload is applied.
As you may have noticed I’m looking at in-memory databases (IMDB) these days… Here are some quick architectural observations on VMWare‘s SQLFire, Oracle’s Exalytics and TimesTen offerings, and SAP HANA.
It is worth noting up front that I am looking to see how these products might be used to build a generalized data mart or a data warehouse… In other words I am not looking to compare them for special case applications. This is important because each of these products has some extremely cool features that allow them to be applied to application-specific purposes with a narrow scope of data and queries… maybe in a later blog I can try to look at some narrow use-cases.
Further, to make this quick blog tractable I am going to assume that the mart/dw problem to be solved requires more data than can fit on one server node… and I am going to ignore features that let queries access data that resides on disk… in-memory or bust.
Finally I will assume that the SQL dialect supported is sufficient and not drill into details there. I will look at architecture not SQL features…
Simply put I am going to look at a three characteristics:
- Will the architecture support ad hoc queries?
- Does the architecture support scale-out?
- Can we say anything with regards to price/performance expectations?
Exalytics is a smart-aggregate store that sits over an Oracle database to offload aggregate query workload (see my previous post here or the Rittman Mead post here which declares: “Oracle Exalytics uses a specially enhanced version of Oracle TimesTen, Oracle’s in-memory database, to cache commonly used aggregates used in dashboards, analyses and other BI objects.” Exalytics does not support a scale-out shared-nothing architecture but it can scale up by adding nodes with new aggregate data. Queries access data within the aggregate structure and it is not possible to join to data off the Exalytics node… so ad hoc is out. Within these limits, which preclude Exalytics from being considered as a general platform for a mart or warehouse, Exalytics provides dictionary-based compression which should provide around 5X compression to reduce the amount of memory required and reduce the amount of hardware required.
TimesTen can do more. It is a general RDBMS. But it was designed for OLTP. I assume that the reason that Oracle has not rolled it out as a general-purpose data mart or data warehouse has to do with constraints that grow from those OLTP architectural roots. For example, BI queries run longer and require more data than a OLTP query… and even with data in-memory temporary storage is required for each query… and memory utilization is a product of the amount of data required and the amount of time the data has to inhabit memory… so BI queries put far more pressure on an in-memory DBMS. There are techniques to mitigate this… but you have to build the techniques in from the ground up.
I imagine that this is why TimesTen works for Exalytics, though. A OLAP query against a pre-aggregated cube does not graze an entire mart or warehouse. It is contained and “small data” (for my wacky take re: Exalytics and Exadata see here).
TimeTen is not sharded… so scalability is an issue. Oracle gets around this nicely by allowing you to partition data across instances and have the application route queries to the appropriate server. But this approach will not support joins across partitions so it severely limits scalability in a general-purpose mart or warehouse.
SQLFire is a very interesting new product built on top of Gemfire… and therefore mature from the start. SQLFire is more scalable than TimesTen/Exalytics. It supports sharded data in a cluster of servers. But SQLFire has the limitation that it cannot join data across shards (they call them partitions… see here) so it will be hard to support ad hoc queries… They provide the ability to replicate tables to support any sort of joins. If, for example, you replicate small dimension tables to coexist with sharded fact tables all joins are supported. This solution is problematic if you have multiple fact tables which must be joined… and replication of data uses more memory… but SQLFire has the foundation in place to become BI-capable over time.
Performance in an in-memory database comes first and foremost from eliminating disk I/O. All three IMDB product provide this capability. Then performance comes from the efficient use of compression. TimeTen incorporates Oracles dictionary-based “columnar” compression (I so hate this term… it is designed to make people think that Oracle products are sort-of columnar… but so far they are not). Then performance comes from columnar projection… the ability to avoid touching all data in a row to process a query. Neither TimesTen nor SQLFire are columnar databases. Then performance comes from parallel execution. Neither TimesTen nor SQLFire can involve all cores on a single query to my knowledge.
Price comes from compression as well. The more highly compressed the data is the less memory required to store it. Further, if data can be used without decompressing it, then less working memory is required. As noted, TimesTen has a compression capability. SQLFire does not appear to compress data. Neither can use compressed data. Note that 2X compression cuts the amout of memory/hardware required in half or more… 4X cuts it to a quarter… and so on. So this is significant.
Now for some transparency… I started the research for this blog, and composed a 1st draft, last Spring while I was at EMC Greenplum. I am now at SAP working with HANA. So… I will not go into HANA at great length… but I will point out that: HANA fully supports a shared-nothing architetcture… so it is fully scalable; HANA is fully parallel and able to use all cores for each query; HANA fully supports columnar tables so it provides deep compression and the ability to use the compressed data in execution. This is not remarkable as HANA was designed from the bottom up to support both BI and OLTP workloads while TimesTen and SQLFire started from a purely OLTP architectural foundation.
David Linthicum suggests here that Shadow IT is not all a bad thing. He references a PricewaterhouseCoopers study that suggests that 30% of all IT spending comes from the business directly… from outside of the IT budget.
In the data warehouse space we can confirm these numbers easily. Just google on “data mart consolidation” to see the impact of the business building their own BI infrastructure in order to get around the time-consuming strictures and bureaucratic processes that IT imposes on a classic EDW platform. Readers… think of the term “data governance”… governance implies bureaucracy. And a “single version of the truth” implies a monopoly (governed by IT). We need a market for ideas to support our business intelligence… and a market is a little chaotic.
What we need is a place where IT says to the business… we cannot get you integrated into our formal EDW infrastructure as fast as you would like… but don’t go and build your own warehouse/mart on your own shadow platform. Let us provide you with a mart in the cloud. Take the data you need from our EDW. Enhance it as you see fit. We can spin up a server to house the mart in the cloud in a couple of hours. Let us help you. Use the tools you want… we think that it is cool that you are going to try out some new stuff… but if you want to use the tools we provide then you’ll get the benefit of our licensing deal and the benefit of our support… but you decide. We need IT to allow a little chaos…
This, I believe is what cloud offers to the data warehouse space…. the platform to respond.
But there is a rub… data warehouse appliances from Teradata, Exadata, and Netezza require bundled hardware that is not going to fit in your cloud. A shared-nothing architecture is a tough fit into the shared disk paradigm of the cloud (see here). The I/O reliance of a disk-based DBMS make performance tough on a shared disk platform. I think that for data marts and analytic sandboxes the cloud is the right choice… if you want to minimize the size of the shadow IT cast by lines of business. An in-memory database (IMDB): HANA, TimesTen, or SQLFire may be the best alternative for a small cloud-based mart.
David Linthicum has it right in spades for the data warehouse space… we need some user pull-through… and we need cloud computing as the platform to make these user-driven initiatives manageable.
Numbers Everyone Should Know
|L1 cache reference||0.5 ns|
|Branch mispredict||5 ns|
|L2 cache reference||7 ns|
|Mutex lock/unlock||25 ns|
|Main memory reference||100 ns|
|Compress 1K bytes with Zippy||3,000 ns|
|Send 2K bytes over 1 Gbps network||20,000 ns|
|Read 1 MB sequentially from memory||250,000 ns|
|Round trip within same datacenter||500,000 ns|
|Disk seek||10,000,000 ns|
|Read 1 MB sequentially from disk||20,000,000 ns|
|Send packet CA->Netherlands->CA||150,000,000 ns|
Of note is the 120X difference between the cost of reading 1MB from memory and the cost of reading 1MB from disk…
The entire presentation may be found here: http://www.odbms.org/download/dean-keynote-ladis2009.pdf
Several months ago I was invited to a dinner attached to a data science summit… with the price being that I had to deliver a 5 minute talk… I had to sing for my dinner. The result was this thinking on real-time analytics and the Toyota Prius.
- Changes in the data are evaluated continuously; and
- The results of the analysis are used or displayed continuously.
- Shared-nothing is required for an EDW,
- An EDW is not usually under-utilized,
- There are difficulties in re-distributing sharded, shared-nothing data to provide elasticity, and
- A SAN cannot provide the same IO bandwidth per server as JBOD… nor hit the same price/performance targets.
Note that these issues are tied together. We might be able to spread the EDW workload over so many shards and so many SANs that the amount of I/O bandwidth per GB of EDW data is equal to or greater than that provided on a DW Appliance. This introduces other problems as there are typically overhead issues with a great many nodes. But it could work.
But what if we changed the architecture so that I/O was not the bottleneck? What if we built a cloud-based shared-nothing in-memory database (IMDB)? Now the data could live on SAN as it would only be read at start-up and written at shut-down… so the issues with the disk subsystem disappear… and issues around sharing the SAN disappear. Further, elasticity becomes feasible. With an IMDB we can add and delete nodes and re-distribute data without disk I/O… in fact it is likely that a column store IMDB could move column-compressed data without re-building rows. IMDB changes the game by removing the expense associated with disk I/O.
There is evidence emerging that IMDB technology is going to change the playing field (see here).
Right now there are only a few IMDB products ready in the market:
- TimeTen: which is not shared-nothing scalable, nor columnar, but could be the platform for a very small, 400GB or less (see here), cloud-based EDW;
- SQLFire: which is semi-shared-nothing scalable (no joins across shards), not columnar, but could be the platform for a larger, maybe 5TB, specialized EDW;
- ParAccel: which is shared-nothing scalable, columnar, but not fully an IMDB… but could be (see C. Monash here); or
- SAP HANA: which is shared-nothing, IMDB, columnar and scalable to 100TB (see here).
So it is early… but soon enough we should see real EDWs in the cloud and likely on Amazon EC2, based on in-memory database technologies.
I found myself wondering where did the rule-of-thumb for Exalytics that suggests that TimesTen can use 800GB of a 1TB memory space… and requires 400GB of that space for work tables leaving room for 400GB of user data… come from (it is quoted everywhere… here is an example… see question #13).
Sure enough, this rule has been around for a while in the TimesTen literature… in fact it predates Exalytics (see here).
Why is this important? The workspace per query for a TPC-A transaction is very small and the amount of time the memory is held by a TPC-A transaction is very short. But the workspace required by a TPC-H query is at least 10X the space required by a TPC-A query and the duration of a TPC-H query is at least 10X the duration of a TPC-A query. The result is at least 100X more pressure on memory utilization.
So… I suspect that the 600GB of user data I calculated here may be off by more than a little. Maybe Exalytics can support 300GB of user data or 100GB of user data or maybe 60GB?
As a side note… it is always important to remember that the pressure on memory is the amount of memory utilized times the duration of the utilization. This is why the data flow architecture used in modern databases like Greenplum are effective. Greenplum uses more memory per transaction but it holds the memory for less time by never (almost) writing it to disk. This is different from older database architectures like Teradata and Oracle which use disk to store intermediate results… lowering the overall amount of memory required but increasing the duration of the query. More on this here…