Moving from Data Warehouse to Data Lake
Before you begin reading this article, it is important to clarify which topics will not be covered:
- This article will not answer the question of which method is better because the answer is not binary.
- This article will not recommend which technology you should use, but it will discuss a number of tools and solutions that could be helpful for you.
What will be covered in this article are the following:
- This article will focus on the most important topics that require your attention before migrating your Data Warehouse to a Data Lake.
- This article will give you a pretty good idea of which direction you should take regarding migration from Data Warehouses to Data Lakes.
Since the topic of non-relational data and Big Data is a “grey” area that uses many buzz words, with numerous vendors scrambling to garner market share, there is a method that you should follow regardless of the different tools you pick. In other words, while each organization has its own unique needs and requires an individualized approach – you should still use a proven methodology that enables segregation between tools and logic layers.
Let’s start with a few basic explanations:
Data Warehouses
Data Warehousing is a method used to store structured data in a relational schema. The data sources are varied and usually come from Online Transaction Processing (OLTP), Enterprise Resources Planning (ERP), and Customer Relationship Management (CRM) systems, in a process called Extract, Transform, and Load (ETL).
The main goal of Data Warehousing is to transform data into a structured format that supports business logic and enables BI tools to create analysis and reports representing “one version of the truth”.
Two of the most common schema types used in Data Warehouses are Snowflake Schema and Star Schema. These schemas are crucial for specific issues, such as:
- Historical data quarrying and changing data over time
- Data quality and integrity
- High quarrying performance and ease of reporting
Data Lakes
Data Lake has become more popular as the reality of petabyte and exabyte ranges of data collected in organizations, including real-time streaming data, online events, and logs, needed to be stored, handled, and analyzed. Data Lakes can hold structured and unstructured data by providing the flexibility to analyze data from different angles and enhance analytic capabilities by answering multiple business questions from a variety of stored data sources. Moreover, Data Lakes are capable of handling massive volumes of stored data.
Summarized differences
Data Lake Weaknesses
- The solution might be expensive computing-wise
- Technical Complexity
- Transactional actions like “update” are not natural for some of the data lakes
Data Warehouse Weaknesses
- Preparation and maintenance costs and time
- Unstructured data types
- Handling Big Data
2 Concepts, 1 Solution
If you are dealing with accounting data, sales, inventory, and customers, I would strongly recommend that you continue working with your Data Warehouse in order to achieve the most accurate numbers.
On the other hand, if you are dealing with data coming from social media, weblogs, streaming data, or call center data, I would strongly recommend that you use Data Lakes.
And although we are dealing with two types of concepts, this doesn’t mean that they don’t “walk hand in hand together”. You may find that to answer a specific business need you have to combine both events data and transactional data. And this can become your new data lake architecture.
Moving From a Data Warehouse To a Data Lake
Here are 2 examples of architecture in the cloud, provided by my dear friend – Omid Vahadaty, a big data Ninja:
-
Architecture
Let’s dive straight into the architecture for Data Lake implementation. As illustrated above, you can see that the first 3-stage Data Lake model has parallels to the Data Warehouse technique (ie. mirroring, staging, and modeling layers), which is recommended regardless of whether you use AWS, GOOGLE, or Azure environments.
This architecture is convenient for keeping everything in order. If a problem arises from one of the layers, it is easier to understand where and how to fix the problem. It also creates tool segregation for each need, so you can replace one tool without affecting another.
Steps for Data Lake creation
Implementing a cloud Data lake architecture is quite easy:
- First – Choose a Data lake solution based on your need and technological environment Contact us if you need help in picking one
- Second – create 3 data sets – Ingestion( for MRR processes), Transformation (for STG processes), and modeling (for DWH)
- Third – bring dump data to your Ingestion (MRR) Dataset.
-
Then Migrate your queries from the ETL process and save them as views in the relevant data sets.
-
Only then, after all the layers are implemented – select an ELT tool to bring the data to the cloud and an orchestration tool to run the queries you just saved.
I hope this article helped to understand the process and its logic. let us know what you think here