How to create a data lake for fun and profit

Andrew C. Oliver

You'll need to learn some of the Hadoop stack such as SqoopOozie, and Flume -- and obtain feeds from your existing systems. Getting this process under way is the bulk of the grunt work; the rest ends up being more of an intellectual exercise.

Next, find a unicorn (aka data scientist), shoot the unicorn in the head because it's probably a shyster anyhow, and drink its blood Voldemort-style. Actually, you won't have to do that, because data scientists do not exist. Data scientists supposedly know advanced mathematics, artificial intelligence, and computer science, and they understand Hadoop -- as well as business and your business data in particular. In addition, they walk on water, bake gluten-free vegan bread that doesn't taste like sawdust, conjure good spirits, and sell you timeshares cheap.

In reality, the people you need are the people you always need: technically adept facilitators who pull the right people with the right knowledge into a room and work through problems. There is no unicorn; we are all the unicorn together.

Start with basic cases and use simple and familiar tools like Tableau (which can connect to Hive) to make nice charts, graphics, and reports demonstrating that, yes, you can do something useful with the data. Bring more stakeholders to the table and generate new ideas for how you can use the data. Advertise the system and its capabilities throughout the organization.

Consider security up front, as well as who can access what data. This will inform the structure of your directories and file locations on HDFS. Deploy Knox to enforce it because by default HDFS trusts the client the same way that NFS does. The idealist says: "Oh, you have a project, go to the data lake." The realist says: "Oh you have a project, get the right permissions in your data lake." At least you're not faced with a big, fat project where you need to provision a VM, get a feed from the relevant systems, create a schema to hold the data, and on and on.

Start with the core Hadoop platform. Don't get fancy at first. Don't launch a massive AI project that replaces the whole organization with your pipe dream of creating Skynet à la Hadoop. Start with bringing the data analytics to the people and making the data more accessible to them. Find a way to let people go fishing in the lake for what they want.

About relational data
Realistically, you can't dump everything in the data lake without messing with it first. As you work through your use cases, you may find the need to flatten some of your data, especially if it came from a relational source.

Previous Page  1  2  3  Next Page