How to Avoid Drowning in a Data Lake
So what exactly is a data lake? And how does one avoid drowning in it?
An excellent questions and one that on the face of it is relatively easy to answer. According to Gartner:
“The growing hype surrounding data lakes is causing substantial confusion in the information management space, according to Gartner, Inc. Several vendors are marketing data lakes as an essential component to capitalize on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.”*
Gartner sums it up quite well. A data lake is a marketing spin on Big Data. It is another term for basically the same thing (i.e. lots of data), although those talking about lakes would say that they are more contained and easier to manage than Big Data. I suppose this could have some element of truth, if it were not for the dollar sign that has just been placed before the term! To sum up: A data lake is another Big Data term, but generally used for a smaller chunk of Big Data. But still a lot of it! So data lakes really are Big Data. I hope that has not confused the issue!
Now on to the more pressing issues around Big Data and data lakes. How do you manage them so that you are not overwhelmed with the amount of material at hand?
Over the last two years there has been a lot of hype around Big Data use. Hype that says all of this data can allow you to grow market share, help you build more responsive systems, or can even help predict the rate of flu virus spread. But, this has only has been found to be true in a very few cases.
Generally the hype around all of this data has been just that, hype. The promised land of sales growth or better customer experience has not manifested. Instead, we have a large number of organizations paying a river of cash to a smaller subset of suppliers for statistics from data that is in essence useless because it’s in the form that the statistical model dictates and has additional data points added. This is not to say that the data itself is useless. Rather, that how we have so far been using it, has little or no practical application use to the purchaser due to its format.
Remember the old adage, “You can make statistics mean anything you want?” Well, add millions or even billions of data points and then create your statistics and they will say whatever you like. Set your proof points to be ‘x’ and you will get enough of them to prove it. Whatever pattern you are seeking will appear because the data universe is so large. This is when you have truly drowned.
I believe the best way to manage and use this data is to make it in to bite-sized chunks. Let us leave lakes well behind and start to talk “puddles.” A data puddle is easy to manage. When was the last time you heard of someone drowning in a puddle?
You have massive amounts of data – yes. But your customers are individuals.
The true power of your data lies in the small puddles about individual customers or users. Those pieces of data that can say, “Last week Jim used 140% of his allocated resources.” And lets us discover “Why did Jim’s use spike? Does he need an upgrade? Or is it a one off?”
Now we can go back and look at some historical data on Jim and create a smaller set of information about Jim and maybe those others that are directly associated with him or her. Once we have this data, we can start to see if there is a pattern. And then we can start to harness the information that we have.
Instead of imposing patterns that may not exist, it makes more sense to extrapolate from smaller patterns to larger trends. By analyzing smaller, more relevant sets of data, we’re putting our data to better use and using it to increase our bottom line.
So to sum up: I believe the best way not to drown in a Data Lake is not to have one. We have masses of data at hand, data that is really important. Data that is Big and data that takes up Terabytes of storage. To truly harness this data and not drown in it, we need to start from the basics of what we are looking for and seek it in smaller, simpler sets of data, not take it all and try to fit it to what we or someone else wants us to see.