I would like to point you to two articles in the latest Northern California Oracle Users Group (NoCOUG) Journal here.
The first is an interview of Kevin Closson here. The interview is long and will take some time to get through… so set aside 30 minutes… it will be worth it as Kevin discusses Exadata, shared-nothingness, and other topics related to database hardware architecture.
The second article I would like to suggest (by the way there are several other excellent articles) is by Dr. Bert Scalzo. He reminds us that our job as engineers is to build the most cost-effective solution… not to build the perfect solution. He suggests that hardware should be treated as a dynamic resource that can be provisioned easily to solve performance problems.
I have argued that in a shared-nothing, scalable, architecture it is often cheaper to add another $20,000 fat server than to spend $100,000 of staff time to tune around a performance problem. This is especially true when the tuning involves building indexes and materialized views or pre-aggregated tables that make your warehouse fragile and more difficult to tune the next time. See here…
Back to Kevin’s interview and to tie the two articles together… Kevin suggests that as long as data flows into the CPUs fast enough then there is no reason to pick a shared-nothing architecture over a shared-everything architecture. He insists on symmetry and rightfully points out that a shared-everything system can be symmetrical. But it is more difficult to maintain symmetry as you scale up a shared-everything system… and easy scale is what is required to treat hardware as a dynamic resource. So… I remain convinced that shared-nothing is the way to go…
I found myself wondering where did the rule-of-thumb for Exalytics that suggests that TimesTen can use 800GB of a 1TB memory space… and requires 400GB of that space for work tables leaving room for 400GB of user data… come from (it is quoted everywhere… here is an example… see question #13).
Sure enough, this rule has been around for a while in the TimesTen literature… in fact it predates Exalytics (see here).
Why is this important? The workspace per query for a TPC-A transaction is very small and the amount of time the memory is held by a TPC-A transaction is very short. But the workspace required by a TPC-H query is at least 10X the space required by a TPC-A query and the duration of a TPC-H query is at least 10X the duration of a TPC-A query. The result is at least 100X more pressure on memory utilization.
So… I suspect that the 600GB of user data I calculated here may be off by more than a little. Maybe Exalytics can support 300GB of user data or 100GB of user data or maybe 60GB?
As a side note… it is always important to remember that the pressure on memory is the amount of memory utilized times the duration of the utilization. This is why the data flow architecture used in modern databases like Greenplum are effective. Greenplum uses more memory per transaction but it holds the memory for less time by never (almost) writing it to disk. This is different from older database architectures like Teradata and Oracle which use disk to store intermediate results… lowering the overall amount of memory required but increasing the duration of the query. More on this here…
Here is a sound bite on Big Data I composed for another source…
Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.
To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.
With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?