donderdag 31 december 2009

How to ETL a dimension in a datawarehouse (part I)

Today I want to discuss some different methods of extract, transform and loading (ETL) a dimension into a datawarehouse. With ETLing a dimension into a datawarehouse i mean loading (ETL) the information from a source system (OLTP) into dimensiontable which is part of a star model. There are several approaches of doing this. I will discuss the different ways which you can use to load a dimension.

I think there are two methods of ETLing a datawarehouse. Some guys from SQLBI wrote some articles on how they perform their loading method (SQLBI articles). They pointed me on a (obvious) loading method for small sized systems. So as I said earlier there are two general loading methods for a dimension:

• ‘ Loading the datawarehouse on the fly’. You can discuss whether this fits into the definition of datawarehouse (non volatile? blabla). But this approach can be used when the data amount is small or at least loadable during the night. I heard that some Inmon minded datawarehouses generate star models on the fly from a (relational) datawarehouse. But, I haven’t seen this working (meaning it could be there but haven’t seen it yet).

• Keeping record of earlier loaded information in your datawarehouse. Meaning that a process of loading the information into a dimensiontable, keeps track of whether the information hasn’t loaded earlier. If the record is loaded earlier, don’t process further. If not loaded before load (new, changed) the record into your dimensiontable. And, if deleted mark the specific record deleted.

What do want to store in your dimension?

• a new record in the source system (table, view, etc).

• a changed record in the source system (table, view, etc).

• a deleted record in the source system (table, view, etc).

The first 2 options of changes (new and changed) in the source are quite non discussable, but detecting a deleted record in the source is more difficult, because it requires an another approach than the first two options. The first two options is simply comparing the incoming records in the datawarehouse against the already loaded information, and then you have to decide what you want to do with this (Update (Type I) or Insert (TYPEII)). But, detection of deleted record is something different.

How do detect a deleted record in the source system. There are two options I can think of:

• Using trigger (CDC or alike) which fires an event when the record is deleted in the source.

• Or detect the deleted records by offering the complete dataset to the datawarehouse for every ETL.

The first method could be named ‘Pushing the deleted record in the datawarehouse’ and the other one is ‘pulling the deleted records into a datawarehouse’. Both has their advantages and disadvantages. I won’t discuss this here.

So this is it for now. A blog shouldn’t be too lengthly… Next part laters.

vrijdag 25 december 2009

Shredding XML

For my current project i'm working with XML data. I have to read XML files into SQL Server, convert them from XML into SQL Server tables and put the data in star schemas. I want to discuss two methods of retrieving values from elements in a XML string: value() and nodes().
Consider the following XML string:
<Message>
 <ID>123</ID>
 <Persons>
  <Person id="1">
   <Name>Hennie</Name>
   <PhoneNo>34343</PhoneNo>
  </Person>
  <Person id="2">
   <Name>Marijn</Name>
   <PhoneNo>23432</PhoneNo>
  </Person>
 </Persons>
</Message>
In the example above you can see one complex type: Persons. As i said earlier i am using two XML methods of retrieving values from a xml String. These are the two methods: the value() and the nodes() function. They utilizes Xquery. As you may know XQuery is a powerful and convenient language designed for processing XML data. That means not only files in XML format, but also other data including databases whose structure -- nested, named trees with attributes -- is similar to XML.

The Value() function can retrieve a scalar value from the xml datatype instance and casts it to a SQL Server datatype instance. The value() method is particularly useful when you need to retrieve a singleton atomic value from your xml datatype instance or for shredding your xml instances when used in conjunction with the nodes() method.

Below you can see an example of shredding a XML string:
DECLARE @index int
DECLARE @xmlString xml
SET @xmlString ='
<Message>
 <ID>123</ID>
 <Persons>
  <Person id="1">
   <Name>Hennie</Name>
   <PhoneNo>34343</PhoneNo>
  </Person>
  <Person id="2">
   <Name>Marijn</Name>
   <PhoneNo>23432</PhoneNo>
  </Person>
 </Persons>
</Message>'

select
@xmlstring.value(N'(/Message/ID)[1]', N'nvarchar(10)') as MessageID,
Mytable.Mycol.value(N'(../../ID)[1]', N'nvarchar(10)') as AnotherMessageID,
Mytable.Mycol.value(N'@id[1]', N'nvarchar(10)') as Personid,
Mytable.Mycol.value(N'(./@id)[1]', N'nvarchar(10)') as AnotherPersonid,
Mytable.Mycol.value(N'(Name)[1]', N'nvarchar(10)') as Name
FROM @xmlString.nodes(N'/Message/Persons/Person') AS Mytable(Mycol);

This would give the following results:


The first line in het SELECT is just a ‘normal’ scalar function:
@xmlstring.value(N'(/Message/ID)[1]', N'nvarchar(10)') as MessageID,
It uses the @xmlstring and tries to find the first ID in the message (there is only one).

The rest of the lines are shredder functions. They use the powerful shredding function nodes(). The nodes() function moves the context from one complex element to another (in this case Person). The value function plays (in combination with the Bulk_Column_Alias (i will blog later about this)) a different role. As the context moves it picks the desired fields from the Person complex type. I tried some different things to see how XQuery and XPath works:
Mytable.Mycol.value(N'(../../ID)[1]', N'nvarchar(10)') as AnotherMessageID
The above line jumps two elements up and shows the same ID as the first line.

The third line wil show the value of the 'id' in the element person:
Mytable.Mycol.value(N'@id[1]'>N'@id[1]', N'nvarchar(10)') as Personid

The fourth line is the same as the third line but with a little difference : ./ which means 'current'
Mytable.Mycol.value(N'(./@id)[1]', N'nvarchar(10)') as AnotherPersonid

The fifth line which i'm using the most will retrieve the 'name' between a starting element and a ending element:
Mytable.Mycol.value(N'(Name)[1]', N'nvarchar(10)') as Name

That's all folks!
Hennie

donderdag 17 december 2009

CBIP certified Today!!!

YES Yup i did it. Today i passed two exams of the CBIP certification: Datawarehousing and Datamanagement. A couple of weeks ago i passed the core examination. So here are the results:
  • Core examnination : 68 %
  • Datawarehousing : 70%
  • Datamanagement : 64%
I thought that datamanagement exam was the easiest one but i miscalculated that.... I had to read some question 3 times before i understood some of the questions.... But i made it. So i like doing it and i think that i will do some exams next year....

About the core examination i talked a couple of weeks ago. See my taglist. The datawarehouse examination was about...off course datawarehousing.  The Data Warehousing exam covers the planning, design, and implementation of the data warehousing environment. It also includes data warehousing functions such as analysis and information delivery, organizational skills required, and roles and responsibilities of the data warehousing professional within the organization.

There were questions about :
  • The ODS classes.
  • Starschemas
  • balanced scorecard.
  • Al lot of questions about metadata, data quality.
  • ETL, etc

The datamanagement exam was the most difficult one or me, for some reason. I expected that  would slide smoothly through this exam... NOT. Some questions were difficult to understand. What is datamanagement exam all about? All business intelligence applications depend on quality Data Analysis and Design. Analysis concentrates on understanding business requirements for data and information. Design focuses on translating business information into data structures. Core skills include information needs analysis, specification of business metrics, and data modeling. Solid understanding of data warehousing concepts, architectures, and processes is also essential.

Below you can see topics which i noticed on the exam:
  • What is the essence about datasecurity
  • Definition of metadata.
  • Objectives of improvement of quality.
  • What are typical names of entities, attributes,
  • What is a typical entity type? Strange question .
  • EDI.


So i'm celebrating this with a lot of enjoy...

CU

Greetz,
Hennie

woensdag 16 december 2009

Varchar(MAX)

Today i had a discussion with some guys from the test team. The discussion was about how they were going to test some the import fields and specific, the length of a field. Our ETL extracts XML messages into dimensions and facts. Some of fields (Elements) in a XML message doesn't have a length restriction so how much characters should some varchar fields be? Hmmmmm Why shouldn't we make all the varchar fields Varchar(MAX)? Why not?!! But first let's do some investigation with my friend Google. And this is what i found out.

Microsoft introduced the VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) data types in SQL Server 2005. These data types can hold the same amount of data BLOBs can hold (2 GB) and they are stored in the same type of data pages used for other data types. When data in a MAX data type exceeds 8 KB, an over-flow page is used. SQL Server 2005/2008 automatically assigns an over-flow indicator to the page and knows how to manipulate data rows the same way it manipulates other data types.

The following problems are identified with varchar fields on the blogs, websites, etc:
  • For data values less than 8000 bytes, this is the common choice as it avoids wasted space. However, if the application can change the size of the data after the initial creation of the row, then there is the possibility of fragmentation occuring through page-splits. Answer: We do only insertions in the dimensions and the fact.

  • With the added benefit that the value can grow larger than 8000 bytes. In that case it will be pushed off-row automatically, and start incurring the extra I/O for each access. Answer: Okay this could happen but if the fields are smaller than 8000 (and mostly it is) will be there no problem. So no problem!

  • An additional benefit of storing the data off-row is that it can be placed in a separate filegroup, possibly on less expensive storage (e.g. RAID 5) - but then there's the drawback that it can't be moved after being created except with an export/import operation. This option has the same online operations drawback as storing the data in-row. Answer: hmmm not sure how you can store off row data in another filegroup. But may be in the future we could this. Let's keep this mind.
So this is it. I can't think of any disadvantages for storing strings in varchar(MAX) fields.  So, i'm really thinking for using this....

Greetz,
Hennie

zondag 13 december 2009

Datamining with SQL Server

Today, i'm gonna tell you something about datamining with SQL Server. Datamining is part of Analysis Services. Data mining helps you make decisions about difficult business problems, It's possible to identify patterns in your data, so you can determine why things happen. It's even psosible to predict what will hapen in the future. You can create data mining models based on cubes, datawarehoues or even excel files.

I used a tutorial from MSDN (http://msdn.microsoft.com/en-us/library/bb510523.aspx) for some pica's. I created a datamining project and used the decision tree algoritm. I created the following decision tree:

The tutorial is about the adventureworks database and datamining is created to understand what kind of customers buys bicycles. The business value is to direct new market campaigns when you know more about your customers. You don't want to send offers to the wrong customer. Other examples are for example: what are common (root) causes of network problems, analyzing servicedesk incidents/changes , productionplanning problems. You can use it for every investigation in cause and effects between attributes.

In the above diagram you'll see that (potential) customers in the age between 36 - 44 are highly(?) predictable bike buyers.

The decisiontree algoritm has the following advantages:
  • Good visual presentation.
  • Rules are understandable.
  • predictability.
  • It shows what is important.
This tool is not end user tool but you can use the data mining toolbar in Office 2007 and 2010.

donderdag 10 december 2009

Prevent Saving Changes

I decided, from now on, i'll write my blog in English instead of in Dutch. So my world is expanded from the Netherlands to the whole world ;-)

Today, i had a strange error when I try to change a table in SQL Server 2008.

Message:

Saving changes is not permitted. The changes you have made require the following tables to be dropped and re-created. You have either made changes to a table that can't be re-created or enabled the option Prevent saving changes that require the table to be re-created.



I didn't ran into this error before. When I try to change a table by adding a primary key and saving it, a warning occurs. The reason is that when you save a table in SQL Server Management Studio requires the table be dropped and recreated behind the scenes (apparently always(?)).

This is by design and can be quickly fixed in Management Studio by unchecking a property. To fix this in Management Studio, go to Tools -> Options then go to the Designer Page and uncheck "Prevent saving changes that require table re-creation".


Why do I have go options menu? Why can’t I choose to accept the (possible) time to recreate the table?

dinsdag 8 december 2009

De toekomst van SQL Server ?

De volgende link http://microsoftpdc.com/Sessions/SVC27 geeft een kijkje in de toekomst van SQL Server. Met name project Houston geeft inzicht in een volledige cloud based omgeving met SQL Server. Erg interessant. Nog even en we doen alles via de browser.

Greetz

zondag 6 december 2009

BIDS hangt? Dan toch gewoon Import and Export Settings Wizard runnen!

Ik heb SQL Server 2008 developer edition op Wndows server 2003 edition op mijn laptop geinstalleerd (in een gevirtualiseerde omgeving) en dit werkt zonder probleem. Bij mijn huidige opdracht is SQL server 2008 Enterprise edition op Windows server 2003 geinstalleerd en had ik voorheen 'hangers'. Beetje gegoogled en ik vond een tip: de 'import en export settings wizard' doorlopen:




En het werkt. Ik heb sindsdien geen hanger meer gehad.

De stappen die je moet doorlopen:
  1. Start BIDS op.
  2. Open een willekeurige package.
  3. Kies Tools.
  4. Kies import en export settings.
  5. Kies "Reset all settings".
  6. Next, etc.
Merkwaardig maar waar.

Greetz

vrijdag 4 december 2009

Al enige tijd gebruik ik indirecte variabelen in SSIS. Ik kreeg warnings dat SSIS "attempting to configure from variable...etc.". Omdat het een warning was dacht ik dat het een melding was dat SSIS de variable probeerde te benaderen (en dat het wel lukte nam ik aan).

Nu ben ik bezig met DTEXEC en xp_Cmdshell en nu kreeg ik wel een duidelijkere foutmelding dat de environment variable niet gevonden kon worden!

The configuration environment variable was not found. The environment variable was: .........This occurs when a package specifies an environment variable for a configuration setting but it cannot be found. Check the configurations collection in the package and verify that the specified environment variable is available and valid

Heh?!

Even googlen leverde op dat de SQL Server service opnieuw gestart moet worden. Probleem opgelost.

Greetz

donderdag 3 december 2009

sysssislog

Vandaag een beetje gespeeld met de logging van ssis. Ik wist niet zeker of ik zelf de logging zou gaan bouwen of dat ik de standaard logging van SSIS ga gebruiken. Ik heb allereerst de logging in de packages ingebouwd. Enorm veel klikken;dank u MS. nadat ik dat gedaan heb ik enkele runs gedaan om te zien wat er in de logging tabel kwam. Ik kan je vertellen dat is enorm veel (als je alle events aanklikt). Ik ben de data gaan analyseren en heb voor mezelf afgevraagd hoe ik de informatie het beste zou willen presenteren:

Package 1 begintijd einddtijd duur
task 1 begintijd einddtijd duur
task 2 begintijd einddtijd duur
Package 2 begintijd einddtijd duur
task 1 begintijd einddtijd duur
task 2 begintijd einddtijd duur

Nou hiervoor heb ik de volgende query gemaakt:



Hieronder in tekstvorm


SELECT
PKG.executionid AS executionid,
PKG.source AS Package,
PKG.PackageStartTime AS PackageStartTime,
PKG.PackageEndTime AS PackageEndTime,
PKG.PackageDurationInSec AS PackageDurationInSec,
TSK.source AS Task,
TaskStartTime AS TaskStartTime,
TaskEndTime AS TaskEndTime,
DATEDIFF(ss,TSK.TaskStartTime, TSK.TaskEndTime) AS TaskDurationInSec
FROM
(SELECT
executionid,
source,
MIN(starttime) AS PackageStartTime,
MAX(endtime) AS PackageEndTime,
DATEDIFF(ss,MIN(starttime), MAX(endtime)) AS PackageDurationInSec
FROM sysssislog
WHERE event IN ('Packagestart', 'PackageEnd') -- Events van een Package.
GROUP BY executionid, source) PKG
INNER JOIN (SELECT
MAX(id) AS id,
executionid,
source,
MIN(starttime) AS TaskStartTime,
MAX(endtime) AS TaskEndTime
FROM sysssislog
WHERE event IN ('OnPreExecute', 'OnPostExecute') -- Events van een task
AND source NOT IN -- Package niet meenemen (is blijkbaar ook een task)
(SELECT source
FROM sysssislog
WHERE event IN ('Packagestart', 'PackageEnd'))
GROUP BY executionid, source) TSK ON PKG.executionid = TSK.executionid
AND PKG.PackageStartTime <= TSK.TaskStartTime AND PKG.PackageEndTime >= TSK.TaskEndTime
ORDER BY id DESC

Al met al ben ik aardig tevreden met het resultaat. Het is niet helemaal perfect omdat de joining op basis van tijd gaat, maar dat gaat wel goed door ook te joinen op executionid. Komende tijd een rapportje op bouwen en een beetje verder testen.

Greetz

Unable to prepare the SSIS bulk insert for data insertion.

Ik kreeg een fout bij het werken met SQL Server destination in SQL Server Integration Services van SQL Server 2008 SP1.

“Unable to prepare the SSIS bulk insert for data insertion. “.

Het gekke is dat de package het altijd deed maar nadat ik de logging heb aangezet, deze fout gaf. Ik importeer een csv bestand in een SQL server destination.

Ergens op internet vond ik dat ik de SQL SERVER destination moet vervangen door een OLE DB Destination. Dus…..Dat heb ik gedaan en het lijkt te werken maar een vage melding is het wel. Waarom dit nu gebeurt is een raadsel.

Enkele zaken vond ik op internet:

“SQL Server Destination only works if the server is on the same box as you are running SSIS.”
“The error message probably happened because the destination SQL Server database was located on a remote server. If this was the case, try using OLE DB destination instead. “

Dat is bij mij het geval. Ik werk op 1 server.

Even aankijken.