zondag 7 april 2013

Bi-Temporal Data warehouse

Introduction

During the last couple of years I've been developing datawarehouses that historizes data. Historizing data is a key concept of a datawarehouse. This enables us to report the same figures through time, even when the data in the source systems change. Creating trends through time is another important feature of an application of a datawarehouse. Currently reading a book of Richard T. Snodgrass about developing time orient database applications in SQL. This book gives me a more theoretical background about the daily job that I am exercising everyday. In this blogpost I would like to elaborate a bit more about (bi)temporal aspects of databases and queries.

Validity

Most databases that are designed for operational purposes, store information that tracks the world around us in a current state. Questions like "What is orderstatus of that product now?" are queries that look at the current situation in a database. Although there are temporalization concepts in the ANSI SQL standards, vendors have not implemented them completely. IBM has done some groundbreaking job with support for temporalization.

Temporality (historizing) is an important feature in a data warehouse because we want to produce reports that give the same result anytime (even when the data in the operational system changes). In this blogpost I'll show you in order to achieve that, you should build a Bitemporal datawarehouse.


As seen above in the diagram, imagine we have no history in our datawarehouse solution and we record only the current situation. And, suppose we have built a report that shows the status of orders in certain periods of time, running the report on certain moments will give different results. This is not a desireable situation. To report stable figures through time, you have to add certain temporal fields to your datawarehouse, like ValidFrom and ValidUntil. ValidFrom denotes the starting instant (the starting day/time) of the period of the validity of a row and a terminating instant of the period of validity.

In the example above we have 3 orders (1, 2, 3) from 3 different customers (A, B and C) and the records represent the current state of the database. No history only the current situation. A lot of information is missing, particularly temporal information.

Suppose we have a sourcesystem and a data warehouse and the information is loaded into the data warehouse every month on the 4th. At the end of the month a copy of the data is created and inserted into a textfile and transferred to the datawarehouseteam. There is changedate field in the order table and therefore we know when the data has changed. But, because there is no history in the operational system only the last change is present in the table and available for the datawarehouse, unfortunately. Yes, we are missing information because in-month changes are not detected.


In the scenario below, I'll show a hypothetical sourcesystem and a datawarehouse. The source system delivers every month a textfile to the datawarehouse. So, there are four files: January, February, March and April and these are delivered at the beginning of the next month (4th day). The files are read instantly into the datawarehouse.

January
The first file that is inserted into the datawarehouse is the file of the month January. The January file and the subsequent loading procedure of the datawarehouse gives the following diagram:


The following events has happened:
  • On January 14 a new order (1) is entered for customer A.
  • On January 18 a new order (3) is entered for customer C.
  • Two records with a current status (31/12/9999 indicates an endless period) are stored in the datawarehouse.

February
On March 4 a new file of month February is delivered at the datawarehouse, as shown below:


The following events in february has occured :
  • On February 13 a new order (2) is entered for customer B. Enddate is 31/12/9999.
  • On February 6 the order (3) for customer C has changed from new to accepted.

March
In April the file is delivered concerning the month March. Also this file is read on the 4th of April


In March a couple of things has happened:
  • The status of the order (2) from customer B changed from New to accepted on March 3.
  • The status of the order (3) from customer C changed from Accepted to OrderPick on March 7.
  • Order 2 with status New is enddated.
  • A new record is created for order 2 with a status Accepted.
  • Order 3 with status Accepted is enddated .
  • A new record is created for order 3 with a status OrderPick.

April
And in May another file (April) is delivered to the datawarehouse and you probably guessed it, changes has happened.


In March a couple of things has occurred:
  • The status of the order (3) from customer C changed from OrderPick to Delivery on March 14. 
  • Order 3 with status OrderPick is enddated .
  • A new record is created for order 3 with a status Delivery.

Now, we can track orders through time. The order status table in the datawarehouse captures the history of reality. Yet, suppose the following situation: the validity of records doesn't say anything about when the record has entered the datawarehouse. There is a difference between the functional meaning of validity and the time that the transaction is entered into a system (this is true for a operational system or a datawarehouse).

Suppose that we run a report twice, once on March 3 and once on March 5 and we want the data for the month February.  W'll see the following reportdata on March 3 and on March 5.


We have different results because the data of February is loaded into the datawarehouse on March 4. The situation of the datawarehouse is changed and we can't reproduce the report of March 3 anymore. As,  I started my blog with the statement that reports should always be reproducable, it seems we have a problem because we can't reproduce the report on March 3.

Transaction Time State

We need extra information in the datawarehouse and particularly, tables. ValidFrom and ValidUntil only captures when records are valid but not when the information is known (inserted) in the datawarehouse and when the information was updated (or deleted). This is where Bitemporal comes into play. A datawarehouse that can be constructed from a previous state is termed "Bi-Temporal  Data warehouse".  In order to be Bitemporal we need to add two date fields to the Order Status table. I've called these transaction-time state fields TransFrom and TransUntil.



Because the information is inserted in the datawarehouse on the 4th of each month the TransFrom and TransUntil is filled with dates like the 4th of a particular month. With these extra fields we now can reconstruct states in the past. Let's go back to the problem I've explained at the end of the former paragraph. I've said that it was impossible to reproduce the report on March 3 without the addition of the extra fields TransFrom and TransUntil. Can we reproduce the report of March 3? Yes, we can. If we select the Order status table with aid of the fields TransFrom and TransUtil we now can reproduce the situation on March 3. This is shown below.


Conclusion

I started this blogpost with the statement that reports should be reproducable. In this blogpost I've shown a situation that this is not true when the data is delivered with batches (files) to the datawarehouse. Build a Bi-Temporal Data warehouse when you have the following situation at hand:
  • There is need for a fully auditable reproducable datawarehouse and or BI solution. Think about accountancy, for example.
  • There is significant difference in time that the data is valid and the data is entered into the datawarehouse.
In order to solve this problem it is necessary to have 4 dates in your data warehouse in order to be repoduce every report at any time. These four dates are also known as "Validity" dates and "Transaction-time state" dates.

Greetz,
Hennie  

1 opmerking: