The Big Data Devil

Devil

Devil (Photo credit: Wikipedia)

I just finished a draft for next week on Big Data and thought that with this note I might form a preface…

First… Big Data is about, well…, Big Data. When Gartner devised the three V’s I suspect that they were trying to frame the new stuff that was emerging… not establish a concise definition. So let me be very clear about what I think that Big Data is and is not.

Big Data is about volume, not velocity, not variety. That is what the words “big” and “data” conjoined must mean. Velocity + Volume is Big Data. Variety + Volume is Big Data. By themselves Velocity and Variety are new, important, separate, technological trends.

Next, Big Data is a new thing. It is not a technology that was around in a meaningful way 5+ years ago. It was emerging just then so we should see evidence in the advances offered by the Web Scale companies like Google, Yahoo, and Netflix. It is not any data that was conventionally created, captured, or used before 2010 or so.

So what is new, big, and was emerging in recent history? It is the creation, capture, and use of machine-generated data: click-stream data, system log data, and sensor data. Big Data technology has to do with the creation, capture, and utilization of large volumes of machine-generated data… nothing more or less.

Rob: Big Data legitimately includes Social Data as well as Vitaliy rightfully commented… I’ll post on this soon…

Machines generate data at a very low-level of detail. It is said that the devil is in the detail… and the subject of the next post deals with the notion that in order to make our companies more profitable we must all chase this damnable devil.

PS

I wonder if damnable devil is redundant? Probably, yes.

2nd PS (sort of like 2nd breakfast)

Big Data is not about any and every new technology introduced in the last five years…

About these ads

7 thoughts on “The Big Data Devil

  1. I agree about the huge piece of machine-generated data in the Big Data pie. But what about human generated social data (classic examples of Facebook and Twitter)?

    Like this

  2. I’m sure Rob will have an interesting take on the machine vs. social data. Mine is that for those companies – the actual social data is not as large a computing problem as the machine data and systems behind each element of social data. For each facebook post there are hundreds of other interactions required to support that single transaction, and solving for that problem will naturally solve for the computing aspect of the social content itself. Sure, there are plenty of complex things to do with just the social data itself – affinity & text analysis to name the obvious – but from a computational aspect, just dealing with the sessionization is a much bigger problem.

    Like this

  3. To be fair, social data probably is legitimately Big Data. It is different, for sure. I have an outline on the topic I’ll try to clean up and publish for your consideration ASAP… and we’ll take up this thread there.

    Like this

  4. Rob, as ever a nice down to earth way of looking at this.
    The other day in a meeting it struck me that not so many years ago there was very little data about me in existence – a few k at most. This data will have been keyed in by dedicated employees of a Bank, Insurer, Driving Licence Centre, Passport Office and will all have been high level and descriptive – name, address, date of birth etc.
    However, now without any effort I leave a rich trail of data behind me, both the data that I type in myself – orders on Amazon and eBay, cinema tickets, utility meter readings – that require no 3rd parties to input for me, and I also lay a trail of machine generated data behind me from my two smartphones (work and personal) both for calls I make as well as my location (which Google also tracks for me), plus my subscription to TomToms traffic service which provides two way communication on my whereabouts. This is in addition to the services I use such as air travel which track my movements around Europe and the rest of the world.
    I guess the point here is that whilst a few years ago I didn’t generate data nowadays I not only do that, but I do it with ease – and without really realising it.
    At the same time it’s become both easy and cheap to capture and process that data in relative terms. Therefore, as you point out, we’re heading into a time which is genuinely new territory – where our ability to generate, capture and process data – without expending great effort – has suddenly expanded by several orders of magnitude.
    I liked the quote – I think it was by Neil Raden – which went “Have you heard the new name for Big Data ? – Its ‘Data’ “. We’ve always been interested in capturing and analysing data, its just that now it has become very easy. This shouldn’t be a surprise to us as we’ve seen a steady move towards the use of digital technology, both for human interactions and for human machine and machine to machine interactions. As this becomes ubiquitous then the spin off of large volumes of digital data that is easy to capture is simply a side effect of this.
    It reminded me of the book ‘Being Digital’ by Nicholas Negroponte, which talks about the effects of digitization ( http://www.amazon.com/Being-Digital-Nicholas-Negroponte/dp/0679762906 ) he was more concerned about how digital technology would transform our day to day experience and interactions as things ‘turn from being composed of atoms to being composed of bits’, but of course once things like music, travel tickets, diaries, telephones stop being physical objects and are simply different types of software or digital patterns then being able to take copies of their digital content as they are used is pretty trivial, and the economics of high volume storage and processing mean that we’re in a position to make use of it – once we figure out what is important and how we do that.
    .. anyhow, just a though, and good discussion to kick off.

    Like this

  5. I am not sure I agree with your definition. Companies were processing huge amounts of data and before, but using a limited number of sources. The difference with “big data” is, first and foremost, the capabilities that the organizations are building around the technologies that are tagged as “big data” technologies. For examples, could companies, using their traditional BI setup, answer questions not supported the schema before? No. Can they do it now? Yes.

    This is, I feel, the distinction between the old and new area, capabilities, and around those capabilities the technologies and organizational structures (i.e. Chief Knowledge Officer, big data department) that support them.

    Like this

    • Hi Andreas,

      An interesting point, for sure… But I’m not sure what the ‘new’ data is that could not be processed before and does not come in large volumes? We all processed what is now called semi-structured data from the mainframe days. Text went into BLOBS. So I’m not sure what it is really that we could not process before (in our BI systems) that we can now that is not big? Could you provide some examples?

      Thanks,

      Rob

      Like this

      • Hi Rob,
        Thank you both for taking the time to answer my comment and for the great articles you’re publishing on your blog; it’s a great resource for people like me who trying to understand more about the market.
        In my first response I am not mentioning that companies weren’t able to process other kinds of data, but they didn’t back then. The term I am using, “capabilities”, has a wider meaning than the word “technologies” or “tools” in the corporate setting. Capabilities in an organization mean not only being able to do something, but being able to do it at a low cost (that has huge importance when it is difficult to establish a persuasive argument about the ROI of a particular project), or being able to do it in such a way that is more meaningful to the business or easier to implement (i.e. its quicker, more flexible, does not create a lot of overhead). Hadoop give us now most of those capabilities.
        I’ll give you another example of my rationale. If you pay close attention to the retail industry a lot of people (including CEOs of large retail chains) are talking about RFID at product level being something that is going to happen sooner or later. Do we have the technology to add a RFID tag in each of the cans that Wal-Mart is selling? Definitely yes. Is it meaningful at the current economics? No. Once we have the proper ‘capabilities’, being new materials, or new economics of scale for RFID production,we will be ready to deploy such solutions.
        Best,
        Andreas.

        Like this

Comments are closed.