BI Future Blog: april 2013

Introduction

Suppose you have a challenge at a customer.You have to load a staging area with data from multiple sources like a couple of files and some tables from another database. All these data is snapshot data of different source systems. The source systems delivers every month a file with all of the customers that are currently active. Another challenge is that not all source systems delivers files on the same frequency. Some of the information is delivered once a couple of months. You could create a SSIS package for every month but You could build a smarter solution.

The Problem

Suppose I've this situation at hand: There are multiple sources where the information is available. These sourcesystems delivers files on a (in)frequent base. For this purpose I've created a meta data table that controls the loading of the information.

Suppose, I've to process 4 months of data and I've 3 sourcesystems that delivers files:

Source A.
Source B.
Source C

Initial setup

For this Lab I've created a table example. In this table are five columns available:

Year.
Month.
Source_A.
Source_B.
Source_C.

Below the script with creating the table and loading the table with data:

       
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[InitialLoad]')
AND type in (N'U'))
DROP TABLE [dbo].[InitialLoad]
GO

CREATE TABLE [dbo].[InitialLoad](
 [Year] [varchar](50) NOT NULL,
 [Month] [varchar](50) NOT NULL,
 [Source_A] [varchar](50) NULL,
 [Source_B] [varchar](50) NULL,
 [Source_C] [varchar](50) NULL
) ON [PRIMARY]
GO

INSERT INTO [InitialLoad] ([Year], [Month], [Source_A], [Source_B], [Source_C] )
VALUES ('2013', '1', 'File_SourceA_20130101', 'File_SourceB_20130101' , NULL),
('2013', '1', 'File_SourceA_20130201', 'File_SourceB_20130201' , NULL),
('2013', '1', 'File_SourceA_20130301', 'File_SourceB_20130301' , 'File_SourceC_20130301'),
('2013', '1', 'File_SourceA_20130401', 'File_SourceB_20130401' , NULL)

SELECT * FROM [InitialLoad]

The SELECT statement results in the following table:

Building the package

First, let me show you the package we are going to build in this blogpost. The package has the following interesting parts:

Variables. There are four variables present, one Object variable (ObjectsList) and three String variables (varSourceA,varSourceB, varSourceC).
Execute SQL Task. In this step we query the InitialLoad table and fill the ObjectsList with the resultset of the Query.
ForEach container. In the Foreach loop container we loop through the ObjectsList Object and fill the appropriate variable with the value of the ObjectsList variable.
Scripts tasks. These tasks are primarly created for debugging purposes.

Now let's walktrough the steps creating this SSIS package.

1. Create a SSIS package
2. Create the four variables ObjectsList, varSourceA, varSourceB and varSourceC. The SSIS Datatypes are important and choose System.Object for the ObjectsList variable and String for the others.
3. Create a connection to the database (in my case TEST).
4. Drag an Execute SQL Task to the package and rename this to "List of Source Objects".
5. Drag a ForEach Loop container to the package.
6. Drag the script components to the ForEach Loop container and name them: Source A, Source B and Source C.
7. Configure the Execute SQL task by setting the right values on the General Tab

And on the Result Set tab:

8.Configure the ForEach Loop container

And enter the variable mappings.

9.Configure the Script tasks

10. Run the package

In Run 1:

Run 2

Run 3:

and in Run 4:

And Successsss......

Conclusion

In this blogpost I've presented a simple method for building a meta driven loading SSIS package with aid of a metadata table and SSIS components like ForEach loop.

Greetz,

Hennie

Introduction

During the last couple of years I've been developing datawarehouses that historizes data. Historizing data is a key concept of a datawarehouse. This enables us to report the same figures through time, even when the data in the source systems change. Creating trends through time is another important feature of an application of a datawarehouse. Currently reading a book of Richard T. Snodgrass about developing time orient database applications in SQL. This book gives me a more theoretical background about the daily job that I am exercising everyday. In this blogpost I would like to elaborate a bit more about (bi)temporal aspects of databases and queries.

Validity

Most databases that are designed for operational purposes, store information that tracks the world around us in a current state. Questions like "What is orderstatus of that product now?" are queries that look at the current situation in a database. Although there are temporalization concepts in the ANSI SQL standards, vendors have not implemented them completely. IBM has done some groundbreaking job with support for temporalization.

Temporality (historizing) is an important feature in a data warehouse because we want to produce reports that give the same result anytime (even when the data in the operational system changes). In this blogpost I'll show you in order to achieve that, you should build a Bitemporal datawarehouse.

As seen above in the diagram, imagine we have no history in our datawarehouse solution and we record only the current situation. And, suppose we have built a report that shows the status of orders in certain periods of time, running the report on certain moments will give different results. This is not a desireable situation. To report stable figures through time, you have to add certain temporal fields to your datawarehouse, like ValidFrom and ValidUntil. ValidFrom denotes the starting instant (the starting day/time) of the period of the validity of a row and a terminating instant of the period of validity.

In the example above we have 3 orders (1, 2, 3) from 3 different customers (A, B and C) and the records represent the current state of the database. No history only the current situation. A lot of information is missing, particularly temporal information.

Suppose we have a sourcesystem and a data warehouse and the information is loaded into the data warehouse every month on the 4th. At the end of the month a copy of the data is created and inserted into a textfile and transferred to the datawarehouseteam. There is changedate field in the order table and therefore we know when the data has changed. But, because there is no history in the operational system only the last change is present in the table and available for the datawarehouse, unfortunately. Yes, we are missing information because in-month changes are not detected.

In the scenario below, I'll show a hypothetical sourcesystem and a datawarehouse. The source system delivers every month a textfile to the datawarehouse. So, there are four files: January, February, March and April and these are delivered at the beginning of the next month (4th day). The files are read instantly into the datawarehouse.

January
The first file that is inserted into the datawarehouse is the file of the month January. The January file and the subsequent loading procedure of the datawarehouse gives the following diagram:

The following events has happened:

On January 14 a new order (1) is entered for customer A.
On January 18 a new order (3) is entered for customer C.
Two records with a current status (31/12/9999 indicates an endless period) are stored in the datawarehouse.

February
On March 4 a new file of month February is delivered at the datawarehouse, as shown below:

The following events in february has occured :

On February 13 a new order (2) is entered for customer B. Enddate is 31/12/9999.
On February 6 the order (3) for customer C has changed from new to accepted.

March
In April the file is delivered concerning the month March. Also this file is read on the 4th of April

In March a couple of things has happened:

The status of the order (2) from customer B changed from New to accepted on March 3.
The status of the order (3) from customer C changed from Accepted to OrderPick on March 7.
Order 2 with status New is enddated.
A new record is created for order 2 with a status Accepted.
Order 3 with status Accepted is enddated .
A new record is created for order 3 with a status OrderPick.

April
And in May another file (April) is delivered to the datawarehouse and you probably guessed it, changes has happened.

In March a couple of things has occurred:

The status of the order (3) from customer C changed from OrderPick to Delivery on March 14.
Order 3 with status OrderPick is enddated .
A new record is created for order 3 with a status Delivery.

Now, we can track orders through time. The order status table in the datawarehouse captures the history of reality. Yet, suppose the following situation: the validity of records doesn't say anything about when the record has entered the datawarehouse. There is a difference between the functional meaning of validity and the time that the transaction is entered into a system (this is true for a operational system or a datawarehouse).

Suppose that we run a report twice, once on March 3 and once on March 5 and we want the data for the month February. W'll see the following reportdata on March 3 and on March 5.

We have different results because the data of February is loaded into the datawarehouse on March 4. The situation of the datawarehouse is changed and we can't reproduce the report of March 3 anymore. As, I started my blog with the statement that reports should always be reproducable, it seems we have a problem because we can't reproduce the report on March 3.

Transaction Time State

We need extra information in the datawarehouse and particularly, tables. ValidFrom and ValidUntil only captures when records are valid but not when the information is known (inserted) in the datawarehouse and when the information was updated (or deleted). This is where Bitemporal comes into play. A datawarehouse that can be constructed from a previous state is termed "Bi-Temporal Data warehouse". In order to be Bitemporal we need to add two date fields to the Order Status table. I've called these transaction-time state fields TransFrom and TransUntil.

Because the information is inserted in the datawarehouse on the 4th of each month the TransFrom and TransUntil is filled with dates like the 4th of a particular month. With these extra fields we now can reconstruct states in the past. Let's go back to the problem I've explained at the end of the former paragraph. I've said that it was impossible to reproduce the report on March 3 without the addition of the extra fields TransFrom and TransUntil. Can we reproduce the report of March 3? Yes, we can. If we select the Order status table with aid of the fields TransFrom and TransUtil we now can reproduce the situation on March 3. This is shown below.

Conclusion

I started this blogpost with the statement that reports should be reproducable. In this blogpost I've shown a situation that this is not true when the data is delivered with batches (files) to the datawarehouse. Build a Bi-Temporal Data warehouse when you have the following situation at hand:

There is need for a fully auditable reproducable datawarehouse and or BI solution. Think about accountancy, for example.
There is significant difference in time that the data is valid and the data is entered into the datawarehouse.

In order to solve this problem it is necessary to have 4 dates in your data warehouse in order to be repoduce every report at any time. These four dates are also known as "Validity" dates and "Transaction-time state" dates.

Greetz,

Hennie

BI Future Blog

donderdag 25 april 2013

SSIS : Using Foreach Loop container building an initial load package

Introduction

The Problem

Initial setup

Building the package

Conclusion

zondag 7 april 2013

Bi-Temporal Data warehouse

Introduction

Validity

Transaction Time State

Conclusion

Zoeken in deze blog