Data Warehouses, Lakes, and Ecosystems...Oh My!
An oldie but goodie look at data ecosystems and how interaction between them is still key
In 2018 I wrote this article on LinkedIn and it's still very relevant and informative so to give it new life, I'm sharing it again on my newsletter! Enjoy! -- M@
This is a data lake...
Not really, but what is a data lake? How is it different than a data warehouse? And what does all of this have to do with culture change?
If we step back for a second, the era of data warehousing was actually brilliant. People started to think for the first time about how to draw together different transactional systems into a central place, harmonize that data to tell a story about business operations, and present it in high-performing reports. Business leaders started changing the culture of their businesses because data generated by different groups within a business process were now being strung together and handoffs between departments were highlighted, especially if they were causing operational inefficiencies. This spawned a number of simplification projects which ultimately connected the people in these processes in ways never before possible.
The advent of many more tools to ingest large data sets in real-time (HVR) and store them on massive computing clusters (Spark, Hadoop, GreenPlum); look at and catalog them (Alation); build models (Dataiku, Alteryx, Data Robot); and interact with them (Tableau, Spotfire, Qlik, et. al.) has changed the entire concept of how we approach data. We’re now able to create data lakes, which are large repositories of raw data from source systems for all business processes. We’ve taken the field of Statistics to another level, calling it Data Science, and focusing it on predicting the future and teaching computers to learn. As you can see from the list above (and I apologize for the somewhat complicated sentence), companies can't take a one-tool-to-rule-them-all approach these days. Instead, the ecosystem of tools is critical, knowing which one is better for which use case. Equally important is their ability to connect and work with each other. If you ever evaluate any of the tools I mentioned above (or others in this space), all of them will tell you how well they connect to each other in order to move your work down the value-stream towards the common goal of discovering insights. In much the same way, companies’ cultures are changing to focus on how people connect, generate data, and combine it to show a new view of what has happened, as well as what could happen (via predictive and prescriptive analytics).
My journey to grow the data and analytics culture is inclusive of making sure people understand that whether you call it a ‘warehouse’ or you call it a ‘certified data set’, we still need to make sure we have stable, verified, and high-performing data models. These models need to power reports, dashboards, and mobile applications that help business leaders optimize their sales, development, and product fulfillment. Data lakes, however, are messy, raw, unstable environments. They need to amalgamate all kinds of data across an entire business to allow data scientists to experiment, train, and develop models that can predict the future and power artificial and augmented intelligence applications. Our culture needs to grow and adapt the new tech instead of trying to replace it: the ecosystem is the key.
I’d love to hear your thoughts and how you’re using a strategy of warehouses, certified data sets, and lakes. How do you think about the utility of a data lake versus the stability of a warehouse? Send me a direct message or comment below. In the meantime, like/share this article and follow me to keep the discussion going on the data culture journey!
Cheers, M@
Matt Brooks is a seasoned thought leader and practitioner in data and analytics; culture; product development; and transformation.