Gartner gets the 'data lake' concept all wrong

Andrew C. Oliver

I didn't want to write about "data lakes" or big data projects this week because I wrote about them for the last two weeks. Then this nonsense was released by Gartner.

We all knew and loved Gartner in the '90s as Microsoft's favorite analyst firm, before it bit the hand that fed it somewhere in the early 2000s. I don't know exactly when Gartner regained its respectability, but its latest diatribe (I suggest you thumb through the summary rather than enduring the Alan Greenspan-like Gartnerese of backtracks and doublespeak) attacks the concept of a data lake without offering any credible alternative. Instead, Gartner suggests you try even harder with data warehousing.

This is tried and true advice that has worked so well throughout human history: Be extra careful, plan really hard, coordinate well with large groups of people, and don't mess up. This great plan was brought to us by the buffer overflow, buffer underrun, privilege escalation, and the fine people from the White Star Olympic line of luxurious sea vessels, because one out of three ain't bad.

The data lake strategy is part of a greater movement toward data liberalization. It started with the printing press and moving the books out of the monastery. Sure, there was confusion and a schism, but did we really want to wait for the monks to decide who gets the handwritten books?

It continues with the Internet. Granted, it's sad that bookstores are toast, but I really hate to wait in line. Yes, Wikipedia has its problems, but in comparison, Encyclopedia Britannica (now on disc) delivers only slightly less erroneous material -- and one-tenth the coverage.

Now Gartner has aligned itself with the data monks who sit over the data and horde it in usually expensive, proprietary technologies. It may be more secure (don't bet on it), and if only those trained (or who have sufficient clout) can access it, then the interpretation may be more accurate -- or the distortions more deliberate.

By that same argument, proprietary software is more secure because only "experts" have access to the source, right?

Gartner critiques vendor marketing concepts of data lakes, along with the intuitive meaning of the name, rather than basing its analysis on any real practice of how data lakes are implemented. Of course you can drown in a data lake! But that's why you build safety nets like security procedures (for example, access is allowed only via Knox), documentation (what goes where in what directory and what roles you need to find it), and (yes, Gartner) governance.

But this needn't involve convening a massive integration project every time someone wants to pull data out in a way that hadn't been thought of before or draw new correlations between data from disparate systems. Sure, people will make mistakes and draw wrong conclusions, but having more people who are well informed is generally better than hoping that some (often technical rather than business-aware) data czar sitting over a data warehouse, as gatekeeper, is going to save you from all this.

1  2  Next Page