Is it really big data we are looking at?
10/23/2014 by Tricia Aanderud
Big data has been a popular term in the data world for the past few years. At the last conference I attended, one person whispered he didn’t know what the term even meant while another group said they had been working with it for years. Still you need a working defintion – I mean is is like pornography – as in I know it when I see it? Or is it more relative – meaning that if all I have is 2 GB thumb drive than anything over that is big data to me. And there’s still the possibility that its all marketing hype.
Let’s start by defining big data and then you can decide for yourself if it’s all hype. When defining big data, many analysts default to Gartner Research’s 3 Vs definition referring to data volume, velocity, and variety. This definition encompasses several points about big data the main one being it is more than size. If an organization is receiving unstructured data faster than it can load it into a database and it’s still growing, a data boundary has been crossed. An example often given is with web logs or social media streams (Twitter, Facebook). Diya Soubra had a good discussion about the 3Vs at Data Science Central. He created the following graphic that I think helps understand the concept.
Gartner has since expanded the 3Vs to 11 different points. Geez .. even the big data definition is increasing! SAS Institute also added variability and complexity to the 3Vs … so I guess 4 Vs and C?
Most of this big data is unstructured or free-form (think videos, social media streams, ZIP files). However it can be semi-structured (think XML). This basically means that the data is not modeled or it’s not in a relational database (RDBMS) already. Databases generally contain transactional data (think records, credit card purchases, trouble tickets) where each field has a preset format and type. Consider how much easier that data is to analyze than the free-form. Here’s a fun, short Intel video clip to provide more insight about big data. It does make you wonder what business value there could be in this data. In my past life working in a customer service department one of my jobs was to complete a customer satisfaction survey. It was web-based and I would get some good comments but I always wondered if I asked the right questions. Sometimes I would listen to the customer calls to see if I could show patterns of discontent. It would have been awesome to have the calls transcribed and then a big tag cloud of common words. This is an example of how companies are using big data. TATA’s big data study revealed that 42% of big data was used for customer facing activities such as Sales, Marketing, and Customer Service. The data is used to better segment and serve customers. There are other examples of big data usage here.
In an Informs Science of Better podcast called Hadoop Anyone?, Brian Keller, Booz Allen Hamilton, defined big data as “when hardware cannot solve the problem.” Good definition because that is the current or traditional approach … just get a bigger machine. Why not just use a bigger hard drive, get one of those fancy IBM 120 PB hard drives? Let’s see, $.05 mb * 120 PB minus good customer discount … that seems cheap enough. A hardware solution may be more likely for someone using an RDBMS to manage the data. Keller’s point is that databases scale data vertically meaning at some point you cannot make the machine larger to solve your data storage and access issues. In other words, data that exceeds the traditional processing capacities. This is why many are turning to Hadoop because the data is distributed across several machines allowing you to scale horizontally. If you need more space, just add another node. Not so easy with RDBMS. [Paul Kent talks about how SAS works with Hadoop and gives customer examples.]
This question is really the main point when it comes down to it … when should I consider an alternate technology? In a recent blog post Hadoop developer, Chris Stucchio, offered the advice of not thinking of installing Hadoop until your data exceeds 4 terabytes. His opinion is that Hadoop is still in its infancy and it may not be worth the hassle. Rakesh Rao, Quaero, offered five clues of when you need to change one being a funny observation about altering the Table of Doom. He was referring to the one table in the database that is so large and important that making changes is a nightmare. It’s most likely a table that is time-consuming to change because of it’s size and importance. Obviously, using the 3Vs definition it appears that when you need to start adding massive amounts of unstructured data to your mix, your organization will need a way to work with it.