BI Future Blog: 2013

zondag 24 november 2013

Creating my first Hekaton table with SQL Server 2014

Introduction

SQL Server 2014 CTP2 has arrived and I have decided to install this new beta version of SQL Server 2014. One of the key new features is the new Hekaton Engine component and this is fully integrated in SQL Server 2014. This new feature takes advantages of the in-memory capabilities. Microsoft proposes to use the in Memory tables for OLTP purposes. Although I've haven't read all the information about Hekaton yet, but one question : "I've is it possible to this feature also for OLAP functionality? And how is this related to the Columnstore indexes?". I hope to find out in future blogposts....

The first thing I did was installing the CTP2 version on a Virtual box machine. You can find numerous posts about creating a virtual machine with Oracle VM Virtualbox and installing CTP Versions of SQL Server on my blog. I'll not elaborate about this subject.

Resources

For this blogpost I've used the following links:

Creating a In memory table

1) First we create a new database with SQL Server 2014 Management Studio.

2) It seems to be necessary to create a Filegroup and a container for the Hekaton memory optimized tables. The filegroup will be used to guarantee the durability of the data resident in memory in case of a crash or a restart. For the container of the filegroup we need to create a directory. Expand the Databases node, right-click your database and then click Properties. To add a new memory optimized data filegroup, click the Filegroups page. Under MEMORY OPTIMIZED DATA, click Add filegroup and then enter the values for the filegroup.

3) To add a file to the filegroup, click the General page. Under Database files, click Add and then enter the values for the file. Use file type FILESTREAM Data.

In the C:\data folder a new folder is created.

4) Create a In Memory table with the following script:

CREATE TABLE dbo.Test (
TestId int not null primary key nonclustered hash with (bucket_count=2000),
TestName int not null index ix_TestName nonclustered hash with (bucket_count=1000),
CreatedDate datetime2 not null,
)
WITH (MEMORY_OPTIMIZED=ON)
GO

CREATE TABLE dbo.Test2 (
TestId int not null primary key nonclustered hash with (bucket_count=2000),
TestName int not null index ix_TestName nonclustered hash with (bucket_count=1000),
CreatedDate datetime2 not null,
)
WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY)
GO

And the tables are created in the Memory. In the folder I've noticed that some files are created and/or modified:

When we take a look at the properties of the table we see that there are some options that are indicating that we have created a memory optimized table:

Conclusions

Woohoo, this is my first created in memory table in SQL Server 2014 CTP2 and it seems very easy doing this. I'll have to look into the details further but it seems an easy job to create and use the memory optimized table in SQL Server 2014.

Greetz,
Hennie

woensdag 20 november 2013

Installing SSDT for SQL Server 2014 CTP2

Today I've installed SQL Server 2014 CTP2 and it seems working pretty fine. But, hey where is SSDT? The replacement of BIDS? I already found out (in one of my former blogposts) that there are 2 versions of SSDT that belonged to SQL Server 2012: SSDT Database developer and SSDT Business Intelligence. But now I can't find anything in the installation procedure about BIDS or SSDT and neither after the installation in the menus, panes or whatever.

There are different downloads available of SSDT on Microsoft.com. I tried multiple versions of SSDT and it seemed troublesome. I tried the different versions multiple times and errors occured. Grrrrrrrrrr. Below a couple of versions and there are differente sizes of the downloads.

And googling SSDT gives me confusing downloads. On the blog of Koen Verbeeck you can find more information about this strange situation. He is also confused about the SSDT situation. But Koen helped me further understanding the problem. On the blog of Jamie thomson more information is found. Here is some information on the SQL Server Data Tools Blog.

As it seems to me you need to install the following software:

SQL Server 2014 CTP
Visual Studio 2012
Update software for SSDT and SQL Server 2014 CTP2

You can download the correct version od SSDT 2012 here. The total amount of software to download is about 800 MB(??). Don't bother the titles of the windows during the installation steps :"Installing SQL Server 2012".

And there's more software needed for the installation...

And...Succeeded....

The templates are installed...

According to the blog of SSDT (SQL Server Data Tools Blog.) a new version or a upgrade version of SSDT for SQL Server 2014 is available. So I decided to install this one too and it seems to be working.

Here you can see some information about SQL Server 2014 during the installation steps.

The installation went well and a confirmation can be found in the Help-About window of visual studio

The version of SSDt for SQL Server 2014 is 11.1.31024.0

I think that the installation procedure isn't quite ready at this moment and that Microsoft is working on it. Hopefully there comes an end to the confusion about SSDT.

Greetz,

Hennie

dinsdag 25 juni 2013

SQL Server 2014

I came across a very interesting diagram on the website of Jen Underwood. This diagram gives an impression of a Hadoop architecture together with products of Microsoft.

Also a comparison is done between SAP and Microsoft. This is very interesting.

Please read the blog of James Serra and Jen Underwood for more information.

Greetz,
Hennie

donderdag 25 april 2013

SSIS : Using Foreach Loop container building an initial load package

Introduction

Suppose you have a challenge at a customer.You have to load a staging area with data from multiple sources like a couple of files and some tables from another database. All these data is snapshot data of different source systems. The source systems delivers every month a file with all of the customers that are currently active. Another challenge is that not all source systems delivers files on the same frequency. Some of the information is delivered once a couple of months. You could create a SSIS package for every month but You could build a smarter solution.

The Problem

Suppose I've this situation at hand: There are multiple sources where the information is available. These sourcesystems delivers files on a (in)frequent base. For this purpose I've created a meta data table that controls the loading of the information.

Suppose, I've to process 4 months of data and I've 3 sourcesystems that delivers files:

Source A.
Source B.
Source C

Initial setup

For this Lab I've created a table example. In this table are five columns available:

Year.
Month.
Source_A.
Source_B.
Source_C.

Below the script with creating the table and loading the table with data:

       
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[InitialLoad]')
AND type in (N'U'))
DROP TABLE [dbo].[InitialLoad]
GO

CREATE TABLE [dbo].[InitialLoad](
 [Year] [varchar](50) NOT NULL,
 [Month] [varchar](50) NOT NULL,
 [Source_A] [varchar](50) NULL,
 [Source_B] [varchar](50) NULL,
 [Source_C] [varchar](50) NULL
) ON [PRIMARY]
GO

INSERT INTO [InitialLoad] ([Year], [Month], [Source_A], [Source_B], [Source_C] )
VALUES ('2013', '1', 'File_SourceA_20130101', 'File_SourceB_20130101' , NULL),
('2013', '1', 'File_SourceA_20130201', 'File_SourceB_20130201' , NULL),
('2013', '1', 'File_SourceA_20130301', 'File_SourceB_20130301' , 'File_SourceC_20130301'),
('2013', '1', 'File_SourceA_20130401', 'File_SourceB_20130401' , NULL)

SELECT * FROM [InitialLoad]

The SELECT statement results in the following table:

Building the package

First, let me show you the package we are going to build in this blogpost. The package has the following interesting parts:

Variables. There are four variables present, one Object variable (ObjectsList) and three String variables (varSourceA,varSourceB, varSourceC).
Execute SQL Task. In this step we query the InitialLoad table and fill the ObjectsList with the resultset of the Query.
ForEach container. In the Foreach loop container we loop through the ObjectsList Object and fill the appropriate variable with the value of the ObjectsList variable.
Scripts tasks. These tasks are primarly created for debugging purposes.

Now let's walktrough the steps creating this SSIS package.

1. Create a SSIS package
2. Create the four variables ObjectsList, varSourceA, varSourceB and varSourceC. The SSIS Datatypes are important and choose System.Object for the ObjectsList variable and String for the others.
3. Create a connection to the database (in my case TEST).
4. Drag an Execute SQL Task to the package and rename this to "List of Source Objects".
5. Drag a ForEach Loop container to the package.
6. Drag the script components to the ForEach Loop container and name them: Source A, Source B and Source C.
7. Configure the Execute SQL task by setting the right values on the General Tab

And on the Result Set tab:

8.Configure the ForEach Loop container

And enter the variable mappings.

9.Configure the Script tasks

10. Run the package

In Run 1:

Run 2

Run 3:

and in Run 4:

And Successsss......

Conclusion

In this blogpost I've presented a simple method for building a meta driven loading SSIS package with aid of a metadata table and SSIS components like ForEach loop.

Greetz,

Hennie

zondag 7 april 2013

Bi-Temporal Data warehouse

Introduction

During the last couple of years I've been developing datawarehouses that historizes data. Historizing data is a key concept of a datawarehouse. This enables us to report the same figures through time, even when the data in the source systems change. Creating trends through time is another important feature of an application of a datawarehouse. Currently reading a book of Richard T. Snodgrass about developing time orient database applications in SQL. This book gives me a more theoretical background about the daily job that I am exercising everyday. In this blogpost I would like to elaborate a bit more about (bi)temporal aspects of databases and queries.

Validity

Most databases that are designed for operational purposes, store information that tracks the world around us in a current state. Questions like "What is orderstatus of that product now?" are queries that look at the current situation in a database. Although there are temporalization concepts in the ANSI SQL standards, vendors have not implemented them completely. IBM has done some groundbreaking job with support for temporalization.

Temporality (historizing) is an important feature in a data warehouse because we want to produce reports that give the same result anytime (even when the data in the operational system changes). In this blogpost I'll show you in order to achieve that, you should build a Bitemporal datawarehouse.

As seen above in the diagram, imagine we have no history in our datawarehouse solution and we record only the current situation. And, suppose we have built a report that shows the status of orders in certain periods of time, running the report on certain moments will give different results. This is not a desireable situation. To report stable figures through time, you have to add certain temporal fields to your datawarehouse, like ValidFrom and ValidUntil. ValidFrom denotes the starting instant (the starting day/time) of the period of the validity of a row and a terminating instant of the period of validity.

In the example above we have 3 orders (1, 2, 3) from 3 different customers (A, B and C) and the records represent the current state of the database. No history only the current situation. A lot of information is missing, particularly temporal information.

Suppose we have a sourcesystem and a data warehouse and the information is loaded into the data warehouse every month on the 4th. At the end of the month a copy of the data is created and inserted into a textfile and transferred to the datawarehouseteam. There is changedate field in the order table and therefore we know when the data has changed. But, because there is no history in the operational system only the last change is present in the table and available for the datawarehouse, unfortunately. Yes, we are missing information because in-month changes are not detected.

In the scenario below, I'll show a hypothetical sourcesystem and a datawarehouse. The source system delivers every month a textfile to the datawarehouse. So, there are four files: January, February, March and April and these are delivered at the beginning of the next month (4th day). The files are read instantly into the datawarehouse.

January
The first file that is inserted into the datawarehouse is the file of the month January. The January file and the subsequent loading procedure of the datawarehouse gives the following diagram:

The following events has happened:

On January 14 a new order (1) is entered for customer A.
On January 18 a new order (3) is entered for customer C.
Two records with a current status (31/12/9999 indicates an endless period) are stored in the datawarehouse.

February
On March 4 a new file of month February is delivered at the datawarehouse, as shown below:

The following events in february has occured :

On February 13 a new order (2) is entered for customer B. Enddate is 31/12/9999.
On February 6 the order (3) for customer C has changed from new to accepted.

March
In April the file is delivered concerning the month March. Also this file is read on the 4th of April

In March a couple of things has happened:

The status of the order (2) from customer B changed from New to accepted on March 3.
The status of the order (3) from customer C changed from Accepted to OrderPick on March 7.
Order 2 with status New is enddated.
A new record is created for order 2 with a status Accepted.
Order 3 with status Accepted is enddated .
A new record is created for order 3 with a status OrderPick.

April
And in May another file (April) is delivered to the datawarehouse and you probably guessed it, changes has happened.

In March a couple of things has occurred:

The status of the order (3) from customer C changed from OrderPick to Delivery on March 14.
Order 3 with status OrderPick is enddated .
A new record is created for order 3 with a status Delivery.

Now, we can track orders through time. The order status table in the datawarehouse captures the history of reality. Yet, suppose the following situation: the validity of records doesn't say anything about when the record has entered the datawarehouse. There is a difference between the functional meaning of validity and the time that the transaction is entered into a system (this is true for a operational system or a datawarehouse).

Suppose that we run a report twice, once on March 3 and once on March 5 and we want the data for the month February. W'll see the following reportdata on March 3 and on March 5.

We have different results because the data of February is loaded into the datawarehouse on March 4. The situation of the datawarehouse is changed and we can't reproduce the report of March 3 anymore. As, I started my blog with the statement that reports should always be reproducable, it seems we have a problem because we can't reproduce the report on March 3.

Transaction Time State

We need extra information in the datawarehouse and particularly, tables. ValidFrom and ValidUntil only captures when records are valid but not when the information is known (inserted) in the datawarehouse and when the information was updated (or deleted). This is where Bitemporal comes into play. A datawarehouse that can be constructed from a previous state is termed "Bi-Temporal Data warehouse". In order to be Bitemporal we need to add two date fields to the Order Status table. I've called these transaction-time state fields TransFrom and TransUntil.

Because the information is inserted in the datawarehouse on the 4th of each month the TransFrom and TransUntil is filled with dates like the 4th of a particular month. With these extra fields we now can reconstruct states in the past. Let's go back to the problem I've explained at the end of the former paragraph. I've said that it was impossible to reproduce the report on March 3 without the addition of the extra fields TransFrom and TransUntil. Can we reproduce the report of March 3? Yes, we can. If we select the Order status table with aid of the fields TransFrom and TransUtil we now can reproduce the situation on March 3. This is shown below.

Conclusion

I started this blogpost with the statement that reports should be reproducable. In this blogpost I've shown a situation that this is not true when the data is delivered with batches (files) to the datawarehouse. Build a Bi-Temporal Data warehouse when you have the following situation at hand:

There is need for a fully auditable reproducable datawarehouse and or BI solution. Think about accountancy, for example.
There is significant difference in time that the data is valid and the data is entered into the datawarehouse.

In order to solve this problem it is necessary to have 4 dates in your data warehouse in order to be repoduce every report at any time. These four dates are also known as "Validity" dates and "Transaction-time state" dates.

Greetz,

Hennie