donderdag 31 december 2009

How to ETL a dimension in a datawarehouse (part I)

Today I want to discuss some different methods of extract, transform and loading (ETL) a dimension into a datawarehouse. With ETLing a dimension into a datawarehouse i mean loading (ETL) the information from a source system (OLTP) into dimensiontable which is part of a star model. There are several approaches of doing this. I will discuss the different ways which you can use to load a dimension.

I think there are two methods of ETLing a datawarehouse. Some guys from SQLBI wrote some articles on how they perform their loading method (SQLBI articles). They pointed me on a (obvious) loading method for small sized systems. So as I said earlier there are two general loading methods for a dimension:

• ‘ Loading the datawarehouse on the fly’. You can discuss whether this fits into the definition of datawarehouse (non volatile? blabla). But this approach can be used when the data amount is small or at least loadable during the night. I heard that some Inmon minded datawarehouses generate star models on the fly from a (relational) datawarehouse. But, I haven’t seen this working (meaning it could be there but haven’t seen it yet).

• Keeping record of earlier loaded information in your datawarehouse. Meaning that a process of loading the information into a dimensiontable, keeps track of whether the information hasn’t loaded earlier. If the record is loaded earlier, don’t process further. If not loaded before load (new, changed) the record into your dimensiontable. And, if deleted mark the specific record deleted.

What do want to store in your dimension?

• a new record in the source system (table, view, etc).

• a changed record in the source system (table, view, etc).

• a deleted record in the source system (table, view, etc).

The first 2 options of changes (new and changed) in the source are quite non discussable, but detecting a deleted record in the source is more difficult, because it requires an another approach than the first two options. The first two options is simply comparing the incoming records in the datawarehouse against the already loaded information, and then you have to decide what you want to do with this (Update (Type I) or Insert (TYPEII)). But, detection of deleted record is something different.

How do detect a deleted record in the source system. There are two options I can think of:

• Using trigger (CDC or alike) which fires an event when the record is deleted in the source.

• Or detect the deleted records by offering the complete dataset to the datawarehouse for every ETL.

The first method could be named ‘Pushing the deleted record in the datawarehouse’ and the other one is ‘pulling the deleted records into a datawarehouse’. Both has their advantages and disadvantages. I won’t discuss this here.

So this is it for now. A blog shouldn’t be too lengthly… Next part laters.

Geen opmerkingen:

Een reactie posten