When I was in college, my housemate Craig* justified his lack of tidiness with a theory he espoused as the "One Pile Method." In practice, this involved dumping all of his clothes, books, homework, sports equipment, and anything else he happened to be carrying right in the middle of his room upon entry. The argument was that anytime he needed anything, he knew right where to look—it had to be somewhere in that one pile. This was claimed to be highly efficient in terms of time and efforts.
I had a girlfriend, Gladys**, who was lovely but perhaps a touch materialistic, too. Once when told "you can't have everything, where would you even put it?," she immediately came back with "I'll rent storage." I remain friends with both individuals to this day, though somehow I've never lived with either one since graduation.
Many practioners of big data seem to be adherents to these general concepts, and now Hadoop enables them to try it for their business. They capture all data coming in the door and drop it right into a heap of storage. The scalability and cost advantages compared with traditional databases or data warehouses makes this not only possible, but also actually attractive. Hoarding data is a good thing, right?
Well, only if you can manage that ever-expanding pile, now sometimes called an "enterprise data lake." Some of the top considerations include:
- How much does all that storage end up costing you?
- Can you really find what you need, when you need it?
- How long does it take you to find and process the right stuff?
- There must be some private stuff in there, who else can get at it?
- How hard is it to build useful applications on the pile? Ok, so the metaphor has officially broken at this point, but you get the idea.
The problem with Hadoop is that it's easy to get started but eventually becomes somewhat hard to monetize, control, and generally operate according to enterprise requirements. I've been arguing lately that the new sexiest job in IT isn't the data scientist, it's the data steward.
EMC and its federation sister companies (Pivotal, VMware, and RSA) have now worked together to make the one pile, ahem, business data lake something that actually works for the business. Their efforts are worthy. They make big data not just feasible, but something you actually want to live with.
You can read more about the solutions here at EMC.com
* Name unchanged—to embarrass the guy.
** Name changed—to protect me from her wrath.