Skip to content

DW Architecture: The ETL Process

November 7, 2012

In the DW Architecture block diagram I presented the Extract, Transform and Load (ETL) process as arrows connecting different repositories.

The ETL is the process use to extract data from the data sources, unstructured and big data repositories, and load the transformed data into the enterprise data warehouse (EDW). ETL is also used when populating the data marts.

To better understand the ETL process, lets briefly narrate the data flow from it inception to it utilization at the presentation layer.

The data produced by the business operations is captured in OLTP systems (structured data) and worksheets, text documents, sound records, videos, PDF files, XML files, etc. (unstructured/semi structured formats). Streaming event data is also collected; this data can be originated from sensors, weblogs files, etc. Streaming event data tends to grown rapidly and reaches high volumes; these characteristics make it expensive to be stored in RDBMS.

It is important to verify the quality of the data near it acquisition point. It is less expensive to correct data errors at this point than cleansing the data later.

During the data acquisition, the master data is used to reduce redundancy, integrate concepts and promote data stewardship.

Depending on the physical implementation, data subsets can be moved to and integrated in an enterprise data repository (EDR) and/or the operational reporting data repository (ORDR).

It is less expensive to store unstructured, semi structured and streaming even data in unstructured and big data repositories (please note that most of the big vendors of RDBMS allow this type of data to be stored in databases).

To be useful for decision support, data need to be processed and integrated. Data is move from the sources systems, the unstructured and big data repositories to the integration/staging area. Commonly, new data (delta) originating from operational databases is replicated, without any modification, to databases in the staging area.

Unstructured, semi structured and streaming event data are consolidated in the unstructured and big data repositories. Different types of implementation exist for this type of data; it can be a unique physical repository that holds both data types, of can be two (2) different physical repositories. Reduction algorithms are applied to this type of data to obtain insight and to identify patterns and trends. The resulting reduced data is send to the staging area to be integrated with data from other origins.

The staging area is a really busy area. Some of the transformations that take place here are: the data from different sources is integrated, code are standardized, data types are standardized, some data is transformed, new data is obtained from existing data (calculated, manufactured), fact and dimension tables are born, transactional fact tables entries are assembled, snapshot fact entries are prepared, data is prepared to be stored in persistent format and to be able to be used later for the different front end tools in the presentation layer. When the data is ready, it is added to the EDW.

Normally, the EWD is enormous, even if it can be use directly to analyze data, obtain insights, or do data mining, it high volume and complexity made it not suitable or fast enough to be use as playground for the end users. To facilitate the data utilization, we derive data from the EWD into data marts, were data is segregated in business domains, organized in dimensional models and results are pre aggregated to allow fast access.

All the manipulation and operations on the data required to move it from the inception point to the EWD and data marts is part of the ETL process. This is a high level view of the ETL process, is not intended to be complete, but to serve to give a general idea of the ETL process.

 

Read Further

The Data Warehouse Architecture

From → DW Architecture

Leave a Comment

Leave a comment