How to create a data lake for fun and profit

Andrew C. Oliver

Most credit James Dixon of the open source BI vendor Pentaho with coining the phrase "data lake." Think of a data lake as an unstructured data warehouse, a place where you pull in all of your different sources into one large "pool" of data.

In contrast to a data mart, a data lake won't "wash" the data or try to structure it or limit the use cases. Sure, you should have some use cases in mind, but the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.

Why would you want a data lake?
The answers are both technical and political. Usually, when you start up any new project that involves analyzing your company's data -- especially when the data is stored across functional areas -- you're in for trouble. For example, if the business unit that wants the data isn't part of the unit providing the data, what kind of priority do you think the unit providing the data likely assign to the effort? How is it budgeted? Who does the integration and how much needs to be done? How do you structure the data and for what purposes?

Assuming you can sort all that out, when you're done, you have a system that can answer only a few preset questions. The next time you need more, you have a whole new project.

The data lake model turn all this on its head. Getting access to the data doesn't require an integration effort, because the data is already there. To start a new project, you merely request the appropriate role or group access (which in most corporate environments means changing Active Directory group assignments). No major integration effort is required; it's all there in the lake and you can apply MapReduce among other algorithms to start crunching it.

Unstructured? Really?
Well, that may be a bit overstated. It isn't that all the data is unstructured, more that we won't perfect a schema as a BDUF (big design up front). You don't know all of the use cases for your data, so how can you know the perfect structure?

Some data is unstructured or not structured by us for a given project, but much of it comes from source systems that structure it differently than we need. A better term for how we store data in the lake is "schema on read" rather than the traditional "schema on write" (or, in some companies, "schema designed months before your first write"). We'll structure the data to the questions rather than attempting to structure the questions to the problems.

How you go about constructing a lake
Remember how we talked about not planning for all use cases? Well, that's true, but it's hard to construct a lake without thinking about any use cases. You should have some in mind. Some may be existing ones, but generally, there is always something that your company wanted to do but couldn't get the data together to execute on. Sometimes you pick obvious, albeit theoretical cases based on your knowledge of the systems you have, the data they contain, and the possibilities for that data.

1  2  3  Next Page