BI Future Blog

donderdag 31 december 2009

How to ETL a dimension in a datawarehouse (part I)

Today I want to discuss some different methods of extract, transform and loading (ETL) a dimension into a datawarehouse. With ETLing a dimension into a datawarehouse i mean loading (ETL) the information from a source system (OLTP) into dimensiontable which is part of a star model. There are several approaches of doing this. I will discuss the different ways which you can use to load a dimension.

I think there are two methods of ETLing a datawarehouse. Some guys from SQLBI wrote some articles on how they perform their loading method (SQLBI articles). They pointed me on a (obvious) loading method for small sized systems. So as I said earlier there are two general loading methods for a dimension:

• ‘ Loading the datawarehouse on the fly’. You can discuss whether this fits into the definition of datawarehouse (non volatile? blabla). But this approach can be used when the data amount is small or at least loadable during the night. I heard that some Inmon minded datawarehouses generate star models on the fly from a (relational) datawarehouse. But, I haven’t seen this working (meaning it could be there but haven’t seen it yet).

• Keeping record of earlier loaded information in your datawarehouse. Meaning that a process of loading the information into a dimensiontable, keeps track of whether the information hasn’t loaded earlier. If the record is loaded earlier, don’t process further. If not loaded before load (new, changed) the record into your dimensiontable. And, if deleted mark the specific record deleted.

What do want to store in your dimension?

• a new record in the source system (table, view, etc).

• a changed record in the source system (table, view, etc).

• a deleted record in the source system (table, view, etc).

The first 2 options of changes (new and changed) in the source are quite non discussable, but detecting a deleted record in the source is more difficult, because it requires an another approach than the first two options. The first two options is simply comparing the incoming records in the datawarehouse against the already loaded information, and then you have to decide what you want to do with this (Update (Type I) or Insert (TYPEII)). But, detection of deleted record is something different.

How do detect a deleted record in the source system. There are two options I can think of:

• Using trigger (CDC or alike) which fires an event when the record is deleted in the source.

• Or detect the deleted records by offering the complete dataset to the datawarehouse for every ETL.

The first method could be named ‘Pushing the deleted record in the datawarehouse’ and the other one is ‘pulling the deleted records into a datawarehouse’. Both has their advantages and disadvantages. I won’t discuss this here.

So this is it for now. A blog shouldn’t be too lengthly… Next part laters.

vrijdag 25 december 2009

Shredding XML

For my current project i'm working with XML data. I have to read XML files into SQL Server, convert them from XML into SQL Server tables and put the data in star schemas. I want to discuss two methods of retrieving values from elements in a XML string: value() and nodes().
Consider the following XML string:

<Message>
<ID>123</ID>
<Persons>
  <Person id="1">
   <Name>Hennie</Name>
   <PhoneNo>34343</PhoneNo>
  </Person>
  <Person id="2">
   <Name>Marijn</Name>
   <PhoneNo>23432</PhoneNo>
  </Person>
</Persons>
</Message>

In the example above you can see one complex type: Persons. As i said earlier i am using two XML methods of retrieving values from a xml String. These are the two methods: the value() and the nodes() function. They utilizes Xquery. As you may know XQuery is a powerful and convenient language designed for processing XML data. That means not only files in XML format, but also other data including databases whose structure -- nested, named trees with attributes -- is similar to XML.

The Value() function can retrieve a scalar value from the xml datatype instance and casts it to a SQL Server datatype instance. The value() method is particularly useful when you need to retrieve a singleton atomic value from your xml datatype instance or for shredding your xml instances when used in conjunction with the nodes() method.

Below you can see an example of shredding a XML string:

DECLARE @index int
DECLARE @xmlString xml
SET @xmlString ='
<Message>
<ID>123</ID>
<Persons>
  <Person id="1">
   <Name>Hennie</Name>
   <PhoneNo>34343</PhoneNo>
  </Person>
  <Person id="2">
   <Name>Marijn</Name>
   <PhoneNo>23432</PhoneNo>
  </Person>
</Persons>
</Message>'

select
@xmlstring.value(N'(/Message/ID)[1]', N'nvarchar(10)') as MessageID,
Mytable.Mycol.value(N'(../../ID)[1]', N'nvarchar(10)') as AnotherMessageID,
Mytable.Mycol.value(N'@id[1]', N'nvarchar(10)') as Personid,
Mytable.Mycol.value(N'(./@id)[1]', N'nvarchar(10)') as AnotherPersonid,
Mytable.Mycol.value(N'(Name)[1]', N'nvarchar(10)') as Name
FROM @xmlString.nodes(N'/Message/Persons/Person') AS Mytable(Mycol);

This would give the following results:

The first line in het SELECT is just a ‘normal’ scalar function:

@xmlstring.value(N'(/Message/ID)[1]', N'nvarchar(10)') as MessageID,

It uses the @xmlstring and tries to find the first ID in the message (there is only one).

The rest of the lines are shredder functions. They use the powerful shredding function nodes(). The nodes() function moves the context from one complex element to another (in this case Person). The value function plays (in combination with the Bulk_Column_Alias (i will blog later about this)) a different role. As the context moves it picks the desired fields from the Person complex type. I tried some different things to see how XQuery and XPath works:

Mytable.Mycol.value(N'(../../ID)[1]', N'nvarchar(10)') as AnotherMessageID

The above line jumps two elements up and shows the same ID as the first line.

The third line wil show the value of the 'id' in the element person:

Mytable.Mycol.value(N'@id[1]'>N'@id[1]', N'nvarchar(10)') as Personid

The fourth line is the same as the third line but with a little difference : ./ which means 'current'

Mytable.Mycol.value(N'(./@id)[1]', N'nvarchar(10)') as AnotherPersonid

The fifth line which i'm using the most will retrieve the 'name' between a starting element and a ending element:

Mytable.Mycol.value(N'(Name)[1]', N'nvarchar(10)') as Name

That's all folks!
Hennie

donderdag 17 december 2009

CBIP certified Today!!!

YES Yup i did it. Today i passed two exams of the CBIP certification: Datawarehousing and Datamanagement. A couple of weeks ago i passed the core examination. So here are the results:

Core examnination : 68 %
Datawarehousing : 70%
Datamanagement : 64%

I thought that datamanagement exam was the easiest one but i miscalculated that.... I had to read some question 3 times before i understood some of the questions.... But i made it. So i like doing it and i think that i will do some exams next year....

About the core examination i talked a couple of weeks ago. See my taglist. The datawarehouse examination was about...off course datawarehousing. The Data Warehousing exam covers the planning, design, and implementation of the data warehousing environment. It also includes data warehousing functions such as analysis and information delivery, organizational skills required, and roles and responsibilities of the data warehousing professional within the organization.

There were questions about :

The ODS classes.
Starschemas
balanced scorecard.
Al lot of questions about metadata, data quality.
ETL, etc

The datamanagement exam was the most difficult one or me, for some reason. I expected that would slide smoothly through this exam... NOT. Some questions were difficult to understand. What is datamanagement exam all about? All business intelligence applications depend on quality Data Analysis and Design. Analysis concentrates on understanding business requirements for data and information. Design focuses on translating business information into data structures. Core skills include information needs analysis, specification of business metrics, and data modeling. Solid understanding of data warehousing concepts, architectures, and processes is also essential.

Below you can see topics which i noticed on the exam:

What is the essence about datasecurity
Definition of metadata.
Objectives of improvement of quality.
What are typical names of entities, attributes,
What is a typical entity type? Strange question .
EDI.

So i'm celebrating this with a lot of enjoy...

Greetz,

Hennie

woensdag 16 december 2009

Varchar(MAX)

Today i had a discussion with some guys from the test team. The discussion was about how they were going to test some the import fields and specific, the length of a field. Our ETL extracts XML messages into dimensions and facts. Some of fields (Elements) in a XML message doesn't have a length restriction so how much characters should some varchar fields be? Hmmmmm Why shouldn't we make all the varchar fields Varchar(MAX)? Why not?!! But first let's do some investigation with my friend Google. And this is what i found out.

Microsoft introduced the VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) data types in SQL Server 2005. These data types can hold the same amount of data BLOBs can hold (2 GB) and they are stored in the same type of data pages used for other data types. When data in a MAX data type exceeds 8 KB, an over-flow page is used. SQL Server 2005/2008 automatically assigns an over-flow indicator to the page and knows how to manipulate data rows the same way it manipulates other data types.

The following problems are identified with varchar fields on the blogs, websites, etc:

For data values less than 8000 bytes, this is the common choice as it avoids wasted space. However, if the application can change the size of the data after the initial creation of the row, then there is the possibility of fragmentation occuring through page-splits. Answer: We do only insertions in the dimensions and the fact.

With the added benefit that the value can grow larger than 8000 bytes. In that case it will be pushed off-row automatically, and start incurring the extra I/O for each access. Answer: Okay this could happen but if the fields are smaller than 8000 (and mostly it is) will be there no problem. So no problem!

An additional benefit of storing the data off-row is that it can be placed in a separate filegroup, possibly on less expensive storage (e.g. RAID 5) - but then there's the drawback that it can't be moved after being created except with an export/import operation. This option has the same online operations drawback as storing the data in-row. Answer: hmmm not sure how you can store off row data in another filegroup. But may be in the future we could this. Let's keep this mind.

So this is it. I can't think of any disadvantages for storing strings in varchar(MAX) fields. So, i'm really thinking for using this....

Greetz,
Hennie

zondag 13 december 2009

Datamining with SQL Server

Today, i'm gonna tell you something about datamining with SQL Server. Datamining is part of Analysis Services. Data mining helps you make decisions about difficult business problems, It's possible to identify patterns in your data, so you can determine why things happen. It's even psosible to predict what will hapen in the future. You can create data mining models based on cubes, datawarehoues or even excel files.

I used a tutorial from MSDN (http://msdn.microsoft.com/en-us/library/bb510523.aspx) for some pica's. I created a datamining project and used the decision tree algoritm. I created the following decision tree:

The tutorial is about the adventureworks database and datamining is created to understand what kind of customers buys bicycles. The business value is to direct new market campaigns when you know more about your customers. You don't want to send offers to the wrong customer. Other examples are for example: what are common (root) causes of network problems, analyzing servicedesk incidents/changes , productionplanning problems. You can use it for every investigation in cause and effects between attributes.

In the above diagram you'll see that (potential) customers in the age between 36 - 44 are highly(?) predictable bike buyers.

The decisiontree algoritm has the following advantages:

Good visual presentation.
Rules are understandable.
predictability.
It shows what is important.

This tool is not end user tool but you can use the data mining toolbar in Office 2007 and 2010.