How to create a data lake for fun and profit

Andrew C. Oliver

While Hadoop scales well, a view of any chart showing how Hive works with joins should give you pause. You may ask: How might I flatten this? For example, take the traditional example of Orders, Order_Items, and Product tables. Anything you do with this data -- except for summarizing orders, which isn't a likely case for analytics -- will join these tables. Why not join them in advance into one flat file?

Even if you summarize orders, filtering out duplicate rows is generally more efficient than joining many, at least up to a point. Even if summarizing was important, there is no reason not to have two views of the data. I mean, what are you doing -- saving disk space? Cheap storage is part of the magic of Hadoop.

Next, expand
Once people start using the data lake and every BI project starts with Hadoop, you can expand your capabilities, adding more external tools and demonstrating capabilities like machine learning and pattern finding with Mahout. Maybe you start streaming data "real time" and adding more processing capabilities with Spark. Maybe you materialize common views in HBase. But don't get derailed along the way. Lake security may have business unit implications, but you shouldn't have a lot of mini lakes (aka data ponds) that are separate and not equal.

If all this still seems a bit confusing, here's the quick and easy version:

  1. Identify a few use cases
  2. Build the lake
  3. Get data from lots of different sources into the lake
  4. Provide a variety of fishing poles and see who lands the biggest and best trout (or generates the most interesting data-backed factoid)

Granted, the more technical analysts will eventually do much of the work, and there is always a risk of misinterpretation. But getting the data in the hands of the people and letting them play with it is good for your lake and your business.

Source: InfoWorld

Previous Page  1  2  3