Member-only story
Data warehouse (part 3): ETL or ELT? A touch of practice as well
As briefly mentioned in part 1 and part 2, we are all probably familiar with the term “ETL”. In this part, let’s dive deeper into ETL and ELT with concepts and a bit of ETL practice in Visual Studio 2019.
As shown in part 1, the nature of traditional RDBMS/cubes leads to a middle section between data source systems/applications and data warehouse, named ETL. Meaning that the raw data from the source should be extracted, transformed then loaded to the data warehouse so that inside datawarehouse, we do not transform any data at all. However with big data, it’s unoptimal to do ETL to the data lake because of data lake’s nature. Due to this, instead of ETL, we have ELT (Extract, Load and Transform), meaning that data is first extracted from the source, loaded into the data lake/data warehouse-lake hybrid and then got transformed then be ready for analysis.
Challenges with traditional ETL:
- Significant business analysis before storing data
- Significant data modelling before storing data
Two variations of ETL
Initial load
- Normally one time-only
- Right before the data warehouse goes live
- all relevant data necessary for BI and analytics (not all possible source data)
- redo if data warehouse “blows up”