My 2 Cents: Netezza 1Q2013

The TwinFin Surf Board

The TwinFin Surf Board (Photo credit: tvanhoosear)

Since my blogs tend to be in response to some stimulus they may not reflect a holistic view on any particular product. The “My 2 Cents” series will try to provide a broader view…

Please consider this as you read on…

Summary

Netezza put a new spin on data warehousing… they made it easy. The Netezza software includes a unique clustered index feature called a zone map that is powerful and easy to use. They also use a FPGA co-processor to augment the CPUs, offloading data compression and projection. When both of these innovations combine Netezza is hard to beat.

Zone maps are powerful when they can be used in a query plan… but the hardware is only good, not great, when zone maps are not in the plan. FPGAs provided a huge boost when Netezza first came on the scene… but as discussed here they do not provide the same boost today. In addition, FPGAs may limit the ability of a Netezza cluster to handle concurrent queries (see here and especially the comments).

The IBM acquisition has opened up a market of Blue shops to Netezza… so they are selling… and as a result Netezza is here to stay.

Where They Win

Of course, Netezza will win in all-Blue shops.

Netezza wins when there is a naturally sequenced field in each big table that is also used in the predicate for most queries. For example, if data is naturally in date/time sequence and every query has a date/time constraint then Netezza is hard to beat. This is the case most often for focussed data marts or single application databases… so look for Netezza for these sort of problems.

Netezza wins when there are a relatively small number of concurrent queries… and they can win when the queries are complex… as long as the zone map is in the plan.

Netezza can win when the POC is designed such that zone maps may be used in the POC… for example when the POC models only a single data load and the data is pre-sorted… even when the real application would fragment the data (for example… data will not naturally enter the warehouse sequentially by customer number… the same customer will be represented time and again… but if you load once only for a POC then you can sort by customer number and use it in the query predicates).

Note that I am not saying that Netezza is a poor performer when zone maps are not used… it is good… but they would never win a POC if no queries used the zone map.

Where They Lose

Guess what? Netezza loses when the zone maps cannot be used or can be used for only a small fraction of the query workload. Note again that the use of a zone map depends on two factors: the data has to be in sequence over all time, and the queries must use the columns mapped in the predicate. If data enters the system out of sequence then the zone map fragments and eventually loses the ability to speed up queries (a few random out of sequence rows are OK).

This constraint makes it hard for Netezza to service data warehouses where, by definition, lots of different user constituencies come at the data from lots of different directions… rather than always using the path grooved with a zone map.

Netezza was designed when only Sybase IQ had columnar oriented tables… today columnar is in nearly every DW database and this allowed the competition to cut deeply into Netezza’s competitive, zone-map enabled, edge. Teradata columns, Greenplum columns, or the natural column stores can win even when zone maps are on target.

Bottom line: do a POC…

In the Market

I spend most of my time in the general market for data warehousing. You won’t see me offer much of an opinion on HANA for BW, for example… even though there are ten thousand plus BW warehouses I just do not see them in the places I work.

Before Netezza was acquired by IBM they were everywhere… in nearly every POC. Now… not so much. To a very large extent they seem to have been directed into the Blue-only customer base (now that I think about it the same thing happened to the Ascential Data Stage suite of ETL products).

My Guess at the Future

As I noted in the reference above… I think that Netezza will eventually go away from the co-processor strategy.

There have been rumors for several years of design that allowed multiple zone maps. This would be very important… but loading out-of-sequence data, which is the necessary the result, could be very slow.

Netezza has lost some of its edge as other technologies added columnar capabilities to their technologies… and Netezza is surely looking at this… but their architecture which includes an execution engine on the server and on the FPGA makes this more complex than you might suspect. Zone maps and two-stage optimization (one in the server and once in the FPGA) is cool… but a tight coupling of the tricks makes for a difficult time extending and adding new features.

If I were the King of Netezza and I could not find a reasonable way to extend beyond the two tricks that got me here I would go with the flow… I would position Netezza as an extremely easy-to-deploy data mart appliance and hook it tightly (i.e. build in some integration) along-side DB2 and Hadoop… and I would cede the EDW space to DB2 and the Big Data space to Hadoop.

Next up… my 2 Cents on Greenplum

May 1, 2013: Here is an update, or maybe a summary, of my view on Netezza… – Rob

About these ads

23 thoughts on “My 2 Cents: Netezza 1Q2013

  1. Rob, Netezza has been able to provide the performance benefits of multiple zone maps per table, without restriction as to data type – and with equal benefit regardless of data ordering or arrival – through the use of clustered base tables for a couple of years now (since version 6, I believe). With v7, zone maps get even better as they apply to pages rather than extents now (i.e. 24X more granular = significantly less unnecessary I/O) which results in dramatic query performance improvements – even on existing hardware. All of which means almost this entire posting is, at best, misleading and inaccurate.

    Like

    • Hi Huw… could you send me references to these enhancements? What I described was, for the most part, the result of my experiences in competing with Netezza. I would love to correct any mistakes. In particular I would love to understand how you implement multiple zone maps when the data cannot possibly be sequenced for each map… and what the implications on load times are?

      I would agree that if both the constraints I describe are removed: only one zone map and a requirement to load data always in sequence… then my post is in error. But what I saw, within the last year, was POCs run by IBM where only a single zone map was implemented and where unsorted data broke the map.

      Like

  2. Sure – you want to read up on Clustered Base Tables (CBTs) that are created with an ‘ORGANIZE ON’ clause. Existing non-CBT tables can easily become so through use of the ALTER TABLE statement (and vice versa) without the need to reload, or resort the data with a CTAS type thing. You can organize on up to 4 separate columns at a time (which results in zone maps that provide an equal benefit for each), and the swiss-army-knife-esque multipurpose GROOM TABLE statement is what physically arranges the data on disk using a very clever (I’m reliably informed) multidimensional fractal clustering algorithm. GROOM is a normal housekeeping task that runs asynchronously and incrementally in the background (it beats as it sweeps as it cleans) without any impact on concurrency. Note that this also means a tables organization has no impact on load performance, but does mean recently loaded data may not experience the full benefit of that organization until the next incremental GROOM operation has completed.

    If IBM haven’t been using these in competitive benchmarks, then consider yourself both lucky and forewarned – I’m guessing it’s because they want to demonstrate ‘raw’ unadulterated performance and simplicity/time to market (i.e. basically to show off, without being accused of ‘cheating’). It could also be complacency, perhaps, as maybe they are used to winning without having to do much other than just load data and run queries.

    It could also be perhaps a sneaky game of waiting to see if a competitor gets even close, before dialling performance up a notch. Anyway, so now you know……

    Like

  3. Zone map is similar to partition/bitmap index in oracle. The key is still in the logical and physical design of the tables where we want to maximize the locality of the data per application access pattern. Zone map/partition will help to speed up the queries for beautifully laid out data so that the query response time is independent of the table size.

    CBT is helpful in some very specific use cases. There are costs associated with using CBT. No free lunch here.

    Like

  4. The FPGA and zone maps are not the only ‘tricks’ that got Netezza started…as cool as they are. It took a *lot* of work by Netezza engineers to get Postgres into shape so it could be used as parallel DBMS that could support the data volumes and query performance required.

    I suppose that’s more of a ‘caffeine and keyboard’ challenge rather than a ‘trick’ ;-)

    Whatever innovations/achievements we credit Netezza with, the mantra of ‘performance, value and simplicity’ was, and still is, a welcome addition to the market. Back in 2003 we timed it at under an hour from unpacking the shipping crate, plugging in power and ethernet, and loading data. This was simply unheard of at the time. Probably still is for most folks!

    I’m not convinced Netezza will always win in IBM shops. We’ve seem them lose where the prospect had outsourced all IT infrastructrure…to IBM. There are also those that won’t put Netezza on a short-list because of IBM’s ownership.

    Like Teradata’s bynet, the proprietary FPGA setup is a blocker to public cloud deployment. Should demand for such a deployment choice ever get to a critical mass we may just see a move towards Netezza being available on COTS servers ‘sans FPGA’.

    Like

    • @Paul, by a “trick” I just mean a unique feature… not found in all competitors. The effort to implement Postgres as a shared-nothing database was significant to be sure… but not “tricky”. My bad for using the term euphemistically…

      I am not as big a fan of easy when performance and simplicity requires a set of queries that fit the narrow groove defined by a zone map. In this case “simplicity” and “performance” are traded off for a large set of queries. Teradata and Greenplum are as simple if you just load data into columnar tables and go… and they will perform across a much wider range of queries.

      @John and @Huw, I have not had a chance to look at the details of the new CBTs… but as I understand it a zone map works because, as John points out, the “index” is tied to the physical layout of the data. Other index structures are just that… indexes… with the trade-offs associated. Everybody has indexes… I do not see CBTs as a differentiator for Netezza… and I do not see them as a thing “which results in zone maps…”?

      Like

  5. Heh. I never claimed CBTs were a differentiator or a ‘free lunch’. Nevertheless, they do directly address the incorrect assertions made in the article above as to data having to always be loaded in a particular order, and only a single zone map ever being useful for a table. Just because you do not know, understand, or cannot comprehend a thing does not make it untrue.

    Like

    • I stand by my position that zone maps are cool… But that the data has to be loaded in a particular order to use them.

      I now understand that there is an index, called a CBT… that is not a zone map, that can be built as well…

      Indexes are not any easier here than elsewhere… and they detract from the load data and go approach.

      Column store from others diminishes Netezza’s advantage. Indexes on Netezza is offset completely by indexes on columns…

      The bottom line… Netezza has lost ground in the last 2-4 years and will win fewer POCs as a result…

      Like

  6. A CBT is not an index. A CBT (clustered base table) effectively provides a multi-dimensional zonemap, essentially the ability to organize data by up to 4 columns, with the intention of providing equivalent select performance for each of the columns.

    As a side-benefit, if the columns are of a data type that is not normally eligible for a zonemap, e.g. character, then they WILL get a zonemap created for them, if they are defined as an organizing column.

    Every regular (i.e. non clustered table) has zonemaps automatically created on up to the first 200 columns in the table, this requires no design / thought by an administrator, it just happens automatically. If your data is sorted on a particular column then yes you get the perfect zonemap and the best I/O elimination, but if the data is not sorted a zonemap is still effective in reducing I/O, but just not perfect.

    In release 7.0, zonemaps are now more granular, down to individual 128KB pages rather than 3MB extents, that’s a potential 24x reduction in I/O in the best case, again without any design changes, it’s just available by default. Also in release 7.0, there is a new feature called Distribute Restict Optimization, which enhances the optimizer to be data distribution aware, i.e. only the data slices containing data the query needs will participate in the query, thus significantly increasing the number of concurrent queries that can be executed at a time, moving Netezza from 10′s or queries per second to 100′s of queries per second.

    When you then add into this the new generation of the Netezza appliance, announced in February, which provides a 3x performance uplift and a 50% storage uplift in the same rack footprint, Netezza has once again jumped to the top of the leader board as a result.

    Prior to 7.0 and the latest generation appliance, I would agree that Netezza had lost ground, but not anymore !!

    Your knowledge appears to be somewhat out of date, and even then incorrect, I suggest you get up to date and if possible some hands on with the latest and greatest appliance, so you can make a fair comparison and more accurately compare to the competition.

    Like

    • Sigh. If I Google Netezza CBT here are quotes from the first three entries:

      “A CBT is a user table that contains data which is organzed usng one to four orgainziing keys. CBT improve query performance by making more efficient use of zone maps for the table because the organizing key columns are also zone mapped (Remember your have to Groom the table after data load to make it effective).” http://www.enzeecommunity.com/message/11023

      Regarding the number of columns suported by a CBT: “The more columns you add to your organize on clause, the less effective the zone maps get for all of the columns.” http://www.enzeecommunity.com/thread/3587

      “The data in the CBT is physically organized into extents based on keys. And when we perform a groom operation, the data is physically reshuffled so that the keys are physically co-located on the fewest extents possible. Like a self-enforcing zone map. only a lot more powerful. In a zone map approach, the primary way to get the data contiguous is to re-sort the data and the zone maps fall out for free. In a CBT, we organize the data and the execute groom.” https://www.ibm.com/developerworks/mydeveloperworks/blogs/Netezza/entry/zone_maps_and_data_power20?lang=en

      Based on this it seems clear than a CBT is a structure that sits over a zonemap that provides keys into zones… an index by any other name… and GROOM is CREATE INDEX. As for as your claim that zonemaps are automatically created for up to the first 200 columns… this seems to contradict the notes above and I could find no reference to it. This link provides an overview of what is in 7.0 and includes a link to the release notes… I knew that zonemaps had become more granular… but highly granular indexes have their own issues… which is why hierarchical indexes like B-Trees are so widely deployed.

      Providing a 50% storage uplift is the result of adding bigger disk drives… and providing what you claim to be a 24X reduction in I/O but only a 3X performance boost (after adding upgraded hardware) seems odd?

      Harry, I’m not anti-Netezza… but comments without references that do not hold up when I try to do a fact-check will not convince me to post a retraction. Marketing CBT as something other than an index because it sits over a zonemap does not make it so. Marketing GROOM as something other than a DDL command that CREATES/UPDATES and index does not make it special. CBTs are a very nice feature for Netezza… I’m sure that they can improve performance… but your competition has similar capabilities called indexes.

      Rob

      Like

  7. First of all the second quote is completely incorrect, you have to recognise that those quotes come from an open community containing many newbie’s to Netezza, hence you should not take what is read as gospel !

    Your assumption that a CBT is a structure that sits over a zonemap is also incorrect. A CBT is not a separate structure, a “Clustered Base Table” is one that has it’s data physically organized on disk to group similar values together, to facilitate them residing in fewer disk extents, which leads to reduced I/O. Other databases also offer clustering capabilities, however, they often lead to data “swiss cheese” where storage requirements are increased due to only placing like values into an extent, Netezza’s implementation doesn’t suffer from this problem, hence there is no storage penalty from using them.

    Your statement that GROOM is “CREATE INDEX” is also incorrect, GROOM is the process that performs the physical clustering of the data.

    Zonemaps have been automatically created on the first 200 columns from when they were first introduced, this is nothing new, you will find that may things are not documented, because it is an appliance and you don’t need to concern yourself about such things. If you actually have access to a Netezza system, run the following command “nzsystem showregistry | grep zmapMaxCols” and you will see that the value of this system configuration parameter is 200, this is the maximum number of columns in a table that will have a zonemap created on them, i.e. the default is 200 as I stated previously, it can be changed if you wish, but I don’t know many people that have needed to.

    Regarding the 50% uplift, you are showing your ignorance again and the danger of commenting before undertaking the slightest research and due diligence in your comments, the uplift comes from using smaller rather than bigger disks, just more of them, the latest generation systems have 288 x 2.5 inch 600GB 10K rpm disk drives, in the previous generation it had 96 x 3.5 inch 1TB 7200 rpm disk drives.

    The 24x reduction in I/O is the best case example of the improvement that the page granular zonemap provides, obviously not every query would get this best case benefit.

    The 3x performance figure quoted is for a worst case improvement for a scan oriented query, i.e. not leveraging other recent performance enhancements such as page level zonemaps, cbt’s etc. If you look at the hardware improvements in the latest generation appliance, you have just over 3 times the scan performance, through improved disk I/O, improved FPGA and CPU power.

    So my point was, if you combine the 3x boost from the hardware uplift and the boosts from the recent software performance enhancements, many of which are automatically available without requiring any design/tuning effort, then you can get significant bang for your buck compared to the previous generation.

    Like

    • Harry,

      I will do you the courtesy of disagreeing with you without calling you ignorant.

      First… I did not state it correctly when I said that the 50% increase was due to bigger disk drives. My bad. In most hardware it hardly matters if you put in bigger drives or more drives with a larger total capacity as the bottle neck is at the disk controller… not at the device. Still, I stand corrected. I should have said that the storage uplift was due to adding more disk storage… and noted that this is pretty normal for any hardware upgrade from any vendor.

      Then, thanks for clarifying your 24X/3X claims. As I noted they were odd without the clarification.

      Regarding zone maps and CBTs. If you do a lookup in a separate data structure, in this case called a zone map, based on specified keys to identify which disk extents contain the key value then the conventional term for the separate data structure is “index”. If there is some other data structure that contains pointers to disk extents by key, included in a thing called a CBT… then the other data structure might fairly be called an “index”. If there is a reorganization utility that compresses records by key and builds/updates the data structure mentioned… then this may fairly be called a create index utility. Conversely, if there is no unique data structure and no key lookup logic that selects extents then there is no index. The link here to an IBM Developer Works page suggests that these structures and these lookups exist… and therefore I respectfully suggest that zone maps and CBTs are a form of index… and further I repeat that they surely provide a performance boost to queries running on Netezza.

      Rob

      Like

  8. Rob – I’ll apologise for the ignorant comment, I agree let’s keep it courteous, it was born out of my frustration in comments on open forums that are clearly incorrect and if not corrected other people then take as gospel (case in point being the #2 comment you referenced from the enzee community web site).

    Whilst a zonemap shares some characteristics of an index, e.g. it is a structure to improve the speed of data retrieval, it also differs in several ways too:

    1) The creation and maintenance of zonemaps is fully automatic. There are not many databases that I have come across that provide such a capability, usually indexes will require someone to design/create and often maintain them too.

    2) Zonemaps are insignificant in size and do not reduce the storage available for data. Indexes are usually significant in size and reduce the storage available for data.

    3) Zonemaps do not carry any significant overhead for write operations, Indexes usually carry a significant overhead for write operation.

    I guess I’m not going to change your view, that is your personal opinion. As David notes in the post you referenced, his view is that “an index is a surgical approach to finding data by logging its record-based keys and a record locator into a separately maintained structure. By finding keys, we find physical records” and hence he describes a zonemap as an anti-index, since it operates in completely the opposite way, identifying where not to look as opposed to where to look.

    I might concede that a groom is like a disk defrag or an index reorganisation, but it is not a create index utility, since it does not create a structure that does not already exist.

    I will definitely agree that zonemaps and CBT’s provide a performance boost to queries running on Netezza, and certainly in the case of zonemaps as already stated they are effectively free, requiring no design, no administration, no maintenance, do not detract from data storage etc. etc. they – I don’t see that as a bad thing ;-)

    Like

    • I agree, Harry, that zonemaps require no maintenance… and this makes them very cool and unique. Further I agree that zonemaps and CBTs index into ranges and not into records… and that this differentiation is important for data warehouse workloads. But the verb is still “index”. Saying that it tells you were not-to-look is squishy wording. A zone map uses the key to tell you which extents to read… it tells you where to look. “Anti-index” is marketing, not architecture.

      If the “ORGANIZE ON” clause builds a CBT data-structure that points to extents which is not populated until GROOM then I will agree that GROOM does not do everything that CREATE/UPDATE INDEX does… but it does the important bit.

      Keep me honest, Harry. I stay away from product features and try to keep focussed on bigger, architectural, issues partly because they are the main drivers of performance and functionality… and partly because there are too many features in too many products.

      BTW: you will hate my next post… ;)

      Rob

      Like

  9. Rob. We are looking for a used Netezza with the following specifications. Any ideas where I might be able to find one? Specs are as follows:
    - Mustang Full Rack
    - Model 10100
    - Release 4.0.4
    - 8 SPA, each containing 14 SPUs
    - 8 TB Storage
    - Hosting OS Linux

    Like

  10. You cannot buy a “used” Netezza appliance, the Netezza/IBM license agreement does not permit them to be re-sold.

    Like

    • Thank you. I talked to IBM and they said it would be OK as long as they were an existing customer providing a software license for another used system would be fine. Just need to find system for my customer. Any help would be great.

      David

      Like

Comments are closed.