BI Future Blog: oktober 2010

vrijdag 29 oktober 2010

Quipu: A datavault generator

Today a post about Quipu. This is an open source generator tool. It generates a Datavault data warehouse model, including a staging environment, based on a source data model. This source data model is reverse engineered and stored in the quipu repository in a datastore as a source model. From this source model it's possible to generate a staging model in the datastore (DDL). Also the load code is generated. From the staging tables the datavault model is generated using the staging data model. It's possible to improve the staging model by identifying business keys, marking tables as reference tables or defining aditional relations between tables. From this point it's possible to generate the datavault model. Below i'll show you an example based on the AdventureWorksLT2008R2.

Installation
The installation of Quipu is a easy. I've installed the following components:

The Quipu back-end.
The Quipu front-end.
Java 6 (version 1.6) runtime environment.

For testing Quipu i decided to use the AdventureWorksLT2008R2 database. This is a small subset of the AdventureWorks2008R2 database and the complexity is comparable with the Northwind database.

Start start_quipu.bat file in folder (not sure why, but i think it's a webserver).

Login to Quipu
Okay let's go....Not..i had some problems with the loginscreen of Quipu. At first i got the following window when i tried to log into Quipu:

After trying to connect to the localhost i got the following error:

Network error IOEXception? That doesn't look good. I've found out that TCP/IP setting in the configuration manager of SQL Server was not properly enabled. So i enabled that! The following error occured:

I took me a while to understand this error. The server setting 'authentication' of SQL Server was 'Windows only' and Quipu doesn't allow Windows authentication. So i changed this setting to Windows and SQL authentication and i was in....

Create a connection
A connection needs to be created to a server where the database resides that will be used for creating a datavault model.

Create a datastore
The next step is creating a datastore in Quipu. A datastore is a subset of the repository on a specific connection. You can create a datastore for each user, for a certain work project, or separate datastores for source, staging and datavault schemas. The definition of a datastore is up to you, but you should have at least one datastore for each connection you want to use.

Import schemas
The following step of creating a data vault is identifying the structure of your source data to the Quipu repository. Create a connection to the AdventureWorksLT2008R2 source database and Quipu extracts the meta data directly from the source schema.

Create a staging environment
A staging area is a copy of your source with some additional meta-data. The repository will be used for creating DDL and load scripts for a database.

Generate datavault
After loading a staging schema and adjusting all schema properties accordingly there are several extra options you can choose to manipulate the structure of the datavault you want to generate.

There are some options which i'll blog later on.

Generate the code
After setting all the options, choose a target datastore for the data vault and select a name (an existing name overwrites the existing data vault schema) and push generate!

A logging screen appears which shows the decisions that Quipu has made whether a sourcetable is a hub, satellite or a link in the datavault schema.

Finally a SQL script is generated. The only thing i had to adjust was include the User defined type in the script and it was all set and go!

The result of the script is shown belown:

Conclusion
I've to admit that Quipu is a great tool of generating SQL code that could be implemented in datawarehouse projects. In the short time i spend with the tool it didn't crash or i could find a bug, except for the modelviewer, but that is a beta version, so i assume that they will improve this in a future version.

I'll blog later about the generated datavault schema.

Greetz,
Hennie

woensdag 27 oktober 2010

SSAS: Dimension, hierarchies, levels, members, children, parents, etc

Currently i'm studying for my MCITP SQL BI developer 2008 certification and i thought i should dive into the mystics of MDX and at first glance the terminology sounds fuzzy: dimensions, members, hierarchies, levels, children, descendents, etc. So what are those terms?

A dimension is a familiar term: this is information about an object. A dimension could have multiple hierarchies and hierarchies consists of different levels. Hierarchies can have members and levels can have members. Members have children, children have parents....Okay... Let's try visualize this. The diagram below is my interpretation of the different terminology of MDX.

Hierarchy : Dimensions are defined as structural attributes of a cube made up of levels arranged in hierarchies.
Member : Members are the occurences of a dimension
Level : A level is a set of members of a dimension organized such that all members of the set are at an equal distance from the root of the hierarchy
First child/last child :
Children : On level down in the hierarchy
Parent: The member next level up in the hierarchy.
Grandparent : Two levels up in the hierarchy (2x .parent)
Great grand parent : Three levels up in the hierarchy (3x .parent)
Ancestor : Same as the parent.

So these are some explanations about some MDX terminology.

Hennie

dinsdag 26 oktober 2010

SQL Zaterdag october 23th

The Dutch version of SQL saturday was held on october 23th and had BI track and a DBA track. I joined the BI track. The BI track consists of four sessions:

Chris Webb about DAX
André Kamman about automatingSSIS
Martijn van Amstel about Datavault
Vincent Wiering about SSAS best practice

Chris Webb about DAX
Chris did a wonderful job about telling about DAX and how its used in Powerpivot. He explained there a basically two sorts of calculations : colum based and measures based. The last one is the most interesting. He also explained that DAX in the future it's likely to have an important role in other corporate BI products from Microsoft, so it's something that all BI consultants need to learn now. Chris had the following conclusions:

DAX does the easy stuff very easily;
DAX does the medium stuff well too;
DAX does not do the hard stuff well at all

André Kamman about automating SSIS
SSIS is on of my favorite tools and i use it regularly and so i was interested in what Andre had to say about automating SSIS. I've looked into this stuff myself and built some of dynamic SSIS packages. Andre showed some stuff about C# and SSIS. He told something about the pitfalls and debugging problems. He was able to create SSIS packages with derived columns, multicast, adapters, etc It was not an easy job but he was able to manage a import of 100 tables with this tool. Intersting sessions and this made me think about dynamic generating SSIS packages.

Martijn van Amstel about Data Vault
Marijn van Amster explained plain and simple the datavault methodology. He explained the differences between Datavault and Kimball. He explained why we need datavault and when it could be an aid in (enterprise) datawarehouses. He explained some problems he had in a project with datavaulting. It's was a introduction in Datavault but some things were interesting.

Vincent Wiering on SSAS Best practices

Vincent explained an example about a project of a customer he visited a couple of weeks ago. He explained how he handled the project and how he developed a solution with Excel and SSAS.

Conclusion
So this was a interesting saturday with some learnful sessions about BI and SQL Server and it's advisable to join the next sessions of SQL saturday.

Gr,
Hennie

donderdag 21 oktober 2010

Kimball vs Inmon (Part II) : it's now scientific.

In an earlier post i've written about Inmon vs Kimball, according to Inmon himself. The conclusions in my post are underwritten by a scientific research by Watson & Ariyachandra (2005). In their research document "Types of DW architectures and factors influencing their selection" they are trying to answer the following questions:

What factors lead companies to select a particular architecture and
How successful are the various architectures?

In this well described and easy reading document there are 5 architectures identified after interviews with experts: independent data marts, bus architecture (kimball), hub and spoke (inmon), centralized and federated. Datavault is an variation of the hub and spoke model of Inmon and this isn't discussed in the document. I refer to the document for more detailed explanation about these various datawarehouse architectures. After the researchers have identified the architectured about 500 managers, users, etc of datawarehouse were surveyed so this is quite a thorough research project.

So below i've pointed the most interesting parts of this document:

39% of the respondents uses the hub and spoke model (Inmon).
26 % of the respondents have built and maintains the busarchitecture (Kimball).
The hub and spokearchitecture requires the most time to roll out with an average of 11 months.
The bus architecture takes about 9 months to develop and deploy.
The hub and spoke and the busarchitecture are equally succesful.
One third of the company's switched the architecture.
30% procent of the switchers switched from the hub and spoke to the bus architecture (AND back!). This is awkward! The supporters of the different architectures talked about the failures of the other(!). Perhaps the made initially the wrong choice for a architecture, based on the wrong assumptions?
The hub and spoke model has higher technical issues and needed more expert influence.
There is not a big difference between the hub and spoke model and the centralized datawarehouse.

Conclusion

When to choose the busarchitecture (Kimball):

When the need for a datawarehouse is high (and you want quick results).
High need of information flow between organizational units. I don't understand this. Perhaps it 's meant as the unstructured version of the dataintegration of the hub and spoke architecture?
You want to build a silod system/ departmental datawarehouse (from my earlier post).
When the information requests don't change that much (from my earlier post).
The view of the ware house prior to implementation is more limited in scope.

When to choose the hub and spoke architecture (Inmon):

When there is a high need of integration between organizational units.
The datawarehouse is viewed as strategic.
Percieved ability of the in-house staff is high.
When the information needs are not quite clear (from my earlier post).
When the information request are varied (from my earlier post).
when information requests are very ad hoc (from my earlier post).
The 'system of record' principle is important (from my earlier post).
When there is (more) time to build the datawarehouse.

Thats it for now....

Hennie

dinsdag 19 oktober 2010

SSIS Framework (Part I)

In one of my projects i'm building a SSIS framework as base for housing of all sorts of different SSIS packages. The purpose is to get all SSIS packages standardized in order to get advantages like uniformity, monitoring, one central location for storing meta information, quick error solving, performance tuning, etc.

We have established a framework with all kind of features:

Automatic logging of package information in a SQL Server services database.
Automatic logging of package information in a file.
The use of checkpoints in a package.
The logging of errors in a a SQL Server services database.
Putting bufferfiles in the projectfolder and not in the TEMP location.
Counting the following measures: initial rowcount, final rowcount, extract count, insert count, extrat erorr count and insert error count.

Mainly, in the diagram below is the overall SSIS package shown.There are 5 steps: Truncate of the destination table, Start of audit proces, the dataflow task, ending of the audit proces and deleting of the empty error files and old archive files.

Next time i'll show the steps in more detail.

Greetz,
Hennie

Again some simple MDX queries

Hi,

I gathered again some simple MDX queries which you can try and run on the AdventureWorks 2008 R2 database.

select
from
[Adventure Works];
GO

select
from
[Adventure Works]
where [Measures].[Reseller Sales Amount];
GO

select
from
[Adventure Works]
where [Measures].[Internet Sales Amount];
GO

select
[Measures].[Reseller Sales Amount]
on columns
from
[Adventure Works];
GO

select
[Date].[Calendar].[Calendar Year]
on columns
from
[Adventure Works];
GO

select
{[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns
from
[Adventure Works];
GO

select
{[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns,
[Product].[Product Categories].[Category]
on rows
from
[Adventure Works];
GO

select
{[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns,
{[Product].[Product Categories].[Category],[Product].[Product Categories]}
on rows
from
[Adventure Works];
GO

select
non empty {[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns,
{[Product].[Product Categories].[Category],[Product].[Product Categories]}
on rows
from
[Adventure Works];
GO

select
{[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns,
{[Product].[Product Categories].[Category],[Product].[Product Categories]}
on rows
from
[Adventure Works]
where [Measures].[Internet Sales Amount];
GO

select
non empty {[Date].[Calendar].[Calendar Year],[Date].[Calendar]}
on columns,
non empty {[Product].[Product Categories].[Category],[Product].[Product Categories]}
on rows
from
[Adventure Works]
where [Measures].[Internet Sales Amount];
GO

Greetz,
Hennie

zondag 17 oktober 2010

SSIS : indirect configuration and one configuration per connection in a configuration database

In this post i'll discuss an example with a package configuration in integration services with an indirect configuration (environment variable) to a configuration database. For every connection a reference in a configuration table in a configuration database is created and for every connection a reference is created in the package configuration. Ray Barley blogged about this on MSSQLTIPS.

So when you create a connection is created, there are three steps: 1. create the connection, 2. create the package configuration and 3. Create a reference in de configurations table.

The configurationtable can be divided in different sections: Master configurations, Database connections and project configurations. In the master configurations reside the master definitions like a rootfolder for all of your SSIS projects. In the database connections section every connection is stored once and in the project configurations section project variables could be stored.

Advantage

Flexible.
Strong cohesion in the SSIS package and less with his environment.
Maintainable.
Secure.

Disadvantage

You need a (master or per project) configuration database.
Not very easy portable. You need to create scripts to port from one environment to another.

Greetz
Hennie

woensdag 13 oktober 2010

Information governance, BI Governance, data governance and MDM

For this blog I would like to share some ideas about governance. Currently, there are all kind of types of governance mentioned in literature (and on the web). This post is inspired by this article of Rajan Chandras. He writes about the different types of governance and how to position them to each other. Interesting article because governance and MDM are keywords which are populair these days.

Wikipedia has some good references about Information governance, BI Governance, data governance and MDM. These subjects seems to be related to each other and Rajan show some insights how they are connected.

In a former post of mine i already positioned MDM in relation with data governance (which differs from the article of Rajan), data quality, security, meta data management and a data architecture. In this diagram (of Informatica) data governance is based on MDM. BI governance wasn't even mentioned. Information governance neither.

Rajan makes a distinction between Data Governance and BI Governance. Data governance generally refers to data that comes in through our transactional software applications (Structured data). BI Governance is closely related to data governance, but it's not quite the same. BI Governance is more focussed on who needs it, how can the information be best provisoned, etc.If we integrate the two pieces of opinions we could have something like this:

Just some thoughts of mine..

Hennie

maandag 11 oktober 2010

Some MDX SELECT queries on AdventureWorks

For a while i collect MDX sample queries from different sources like the internet, books, etc for learning and implementing MDX queries. These are based on AdventureWorks 2008 R2. Sometimes, i had to rewrite the query because it was based on the Foodmart cube. So when you're learning MDX install the adventureworks 2008 R2 databases. The analysis database need to be installed manually. Don't you forget this.

Ok here are the sample queries:

SELECT [Date].[Calendar Year].[CY 2005] on 0
FROM [Adventure Works];

SELECT [Date].[Calendar Year].FirstChild on 0
FROM [Adventure Works];

--Defaultmember
SELECT [Date].[Calendar Year].defaultmember on 0
FROM [Adventure Works];

--
SELECT {([Measures].[Reseller Sales Amount] ,[Date].[Calendar Year].[CY 2006])
, ([Measures].[Reseller Sales Amount],[Date].[Calendar Year].[CY 2007])
}.Item(1) ON COLUMNS
FROM [Adventure Works];

SELECT {([Measures].[Reseller Sales Amount] ,[Date].[Calendar Year].[CY 2006])
, ([Measures].[Reseller Sales Amount],[Date].[Calendar Year].[CY 2007])
} ON COLUMNS
FROM [Adventure Works];

SELECT {[Calendar Quarter].[Q1 CY 2005]:[Calendar Quarter].[Q4 CY 2007]} ON 0
FROM [Adventure Works];

SELECT {
   [Calendar Quarter].[Q1 CY 2006],
   [Calendar Quarter].[Q2 CY 2006],
   [Calendar Quarter].[Q3 CY 2006],
   [Calendar Quarter].[Q4 CY 2006]
   } ON 0
FROM [Adventure Works];

-- Named set
WITH SET [EUROPE] AS '{[Customer].[Country].&[France],
[Customer].[Country].&[Germany],
[Customer].[Country].&[United Kingdom]}'
SELECT Measures.[Internet Sales Amount] ON COLUMNS,
[EUROPE] ON ROWS
FROM [Adventure Works];

--Named Set and Set operators
WITH SET [EUROPE] AS '{[Customer].[Country].&[France]} + {[Customer].[Country].&[United Kingdom]}'
SELECT Measures.[Internet Sales Amount] ON COLUMNS,
[EUROPE] ON ROWS
FROM [Adventure Works];

--Named Set and Set operators
WITH SET [EUROPE] AS '{[Customer].[Country].&[France]} + {[Customer].[Country].&[United Kingdom]}'
SELECT Measures.[Internet Sales Amount] ON COLUMNS,
[EUROPE] ON ROWS
FROM [Adventure Works];

--Named Set and Set operators (Cross Product)
WITH SET [EUROPE] AS '{[Customer].[Country].&[France], [Customer].[Country].&[United Kingdom]} *
{[Product].[Product Line].&[M],[Product].[Product Line].&[R]}'
SELECT Measures.[Internet Sales Amount] ON COLUMNS,
[EUROPE] ON ROWS
FROM [Adventure Works];

-- Calculated members
WITH MEMBER Measures.[profit] AS [Measures].[Internet Sales Amount] - [Measures].[Internet Standard Product Cost]
SELECT measures.profit ON COLUMNS,
[Customer].[Country].MEMBERS ON ROWS
FROM [Adventure Works];

with member [Measures].[PCT Discount] AS '[Measures].[Discount Amount]/[Measures].[Reseller Sales Amount]', FORMAT_STRING = 'Percent'
set Top10SellingProducts as 'topcount([Product].[Model Name].children, 10, [Measures].[Reseller Sales Amount])'
//set Preferred10Products as '
//{[Product].[Model Name].&[Mountain-200],
//[Product].[Model Name].&[Road-250],
//[Product].[Model Name].&[Mountain-100],
//[Product].[Model Name].&[Road-650],
//[Product].[Model Name].&[Touring-1000],
//[Product].[Model Name].&[Road-550-W],
//[Product].[Model Name].&[Road-350-W],
//[Product].[Model Name].&[HL Mountain Frame],
//[Product].[Model Name].&[Road-150],
//[Product].[Model Name].&[Touring-3000]
//}'
select {[Measures].[Reseller Sales Amount], [Measures].[Discount Amount], [Measures].[PCT Discount]} on 0,
Top10SellingProducts on 1
from [Adventure Works];

-- Combined named set and calculated set.
-- order products based on the internet Sales Amount and returns the sales amount of each product along with the rank.
WITH
SET [Product Order] AS 'Order([Product].[Product Line].MEMBERS, [Internet Sales Amount], BDESC)'
MEMBER [Measures].[Product Rank] AS 'RANK([Product].[Product Line].CURRENTMEMBER, [Product Order])'
SELECT {[Product Rank], [Sales Amount]} ON COLUMNS,
[Product Order] ON ROWS
FROM [Adventure Works];

--each half year is analyzed along with the cumulative sales for the whole year
WITH MEMBER Measures.[Cumulative Sales] AS 'Sum(YTD(), [Internet Sales Amount])'
SELECT {Measures.[Internet Sales Amount], Measures.[Cumulative Sales]} ON 0,
[Date].[Calendar].[Calendar Semester].MEMBERS ON 1
FROM [Adventure Works];

--each month is analyzed along with the cumulative sales for the whole year
WITH MEMBER Measures.[Cumulative Sales] AS 'Sum(YTD(), [Internet Sales Amount])'
SELECT {Measures.[Internet Sales Amount], Measures.[Cumulative Sales]} ON 0,
[Date].[Calendar].[Month].MEMBERS ON 1
FROM [Adventure Works];

--9,061,000.58
SELECT [Measures].[Internet Sales Amount] ON COLUMNS
FROM [Adventure Works]
WHERE [Customer].[Country].[Australia];

--$11,038,845.45
SELECT [Measures].[Internet Sales Amount] ON COLUMNS
FROM [Adventure Works]
WHERE {[Customer].[Country].[Australia],
[Customer].[Country].[Canada]};

--$9,770,899.74
SELECT [Measures].[Internet Sales Amount] ON COLUMNS
FROM [Adventure Works]
WHERE [Date].[Calendar Year].&[2008];

--
SELECT
    { [Measures].[Sales Amount], [Measures].[Tax Amount] } ON COLUMNS,
    { [Date].[Fiscal].[Fiscal Year].&[2007], [Date].[Fiscal].[Fiscal Year].&[2008] } ON ROWS
FROM [Adventure Works]
WHERE ( [Sales Territory].[Southwest] );

SELECT CROSSJOIN({[Date].[Calendar Year].ALLMEMBERS},{[Measures].[Internet Sales Amount]})
FROM [Adventure Works];

-- Crossjoin, Allmembers (normale dim?), Children (igv een hierarchie?)
SELECT
CROSSJOIN({[Date].[Calendar Year].ALLMEMBERS},{[Measures].[Internet Sales Amount]}) ON AXIS (0),
{[Product].[Product Categories].CHILDREN} ON AXIS (1)
FROM [Adventure Works];

-- : = Range operator
SELECT CROSSJOIN({[Measures].ALLMEMBERS}, {[Date].[Calendar Year].[CY 2005]: [Date].[Calendar Year].[CY 2008]}) ON AXIS (0),
CROSSJOIN({[Product].[Product Categories].CHILDREN},[Customer].[Country].[Country].MEMBERS) ON AXIS (1)
FROM [Adventure Works];

--MDX Functions
WITH MEMBER measures.LocationName AS [Customer].[Country].CurrentMember.Name
SELECT measures.LocationName ON COLUMNS,
Customer.Country.members on ROWS
FROM [Adventure Works];

WITH MEMBER measures.User AS USERNAME
SELECT measures.User ON 0
FROM [Adventure Works];

--SET Functions
SELECT Measures.[Internet Sales Amount] ON COLUMNS,
FILTER(CROSSJOIN({Product.[Product Line].[Product Line].MEMBERS},
{[Customer].[Country].Members}), [Internet Sales Amount] >200000) ON ROWS
FROM [Adventure Works];

--Memberfunction (ParallelPeriod)
SELECT ParallelPeriod ([Date].[Calendar].[Calendar Quarter]
, 3
, [Date].[Calendar].[Month].[October 2007])
ON 0
FROM [Adventure Works];

SELECT ParallelPeriod ([Date].[Calendar].[Calendar Quarter]
, 1
, [Date].[Calendar].[Month].[November 2007])
ON 0
FROM [Adventure Works];

Bye bye,
Hennie

vrijdag 8 oktober 2010

MDM seminar with Informatica

7 october i joined a session about Master Data Management at Informatica. This area of datamanagement is interesting because I think that it will continue emerge and businesses enabling their datagovernance more and more. So what is Master Data Management all about, according to Phlip Howard from Bloor Research? Well, Master data management is one of the pilars of good data governance. Below there is a diagram about the 5 layers beneath Data Governance.

What is datagovernance? Wikipedia: Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise. Data governance ensures that data can be trusted and that people can be made accountable for any adverse event that happens because of low data quality

At first you need to setup a good data architecture. When this is in place you need to develop Meta Data Management, Master Data Management, Data Quality and Data Security. When all of these layers are one and organized then you can set up a governance program.

Ravi Shankar, Senior Director Product Marketing from Informatica, showed the MDM approach from Informatica and he pointed out that DataStewards are very important because they are the link between business and IT. They can manage and monitor dataquality. One thing i noticed is that the business case for MDM seems cross sell and up sell. These are main drivers for MDM and this is logical because large players in the market do have silod systems run by their division. Each of the division has their own systems. So MDM could lead to an integration of the division and that will influence cross sell and up sell. I am wondering what the driver would be in public health sector? Better management information, better decisions, no more duplicate patients, better research capabilities, better invoicing to insurancecompanies?

Ahmed Quadri, Senior Director MDM initiatives showed the different products according to the MDM cycle discover, model, cleanse, recognize, resolve, relate, govern and deliver. Ravi Shankar showed some client cases afterwards.

After the seminar I was invited for a lunch and I had a change to exchange some thoughts with Ravi Shankar and I asked him when do you choose the registry, repository or the hub approach. Well, he mentioned one example and that was a public health hospital in America and he said that for privacy- and security reasons a registry implementation would be the best solution.

Greetz,

Hennie

woensdag 6 oktober 2010

SSIS : Six Scenarios and a best practice for the SSIS Package Configurations

Introduction

I had a discussion with a colleague about the best way to make complete independent SSIS packages (or at least try as much as we can). When you develop packages in a DTAP environment (Develop, Test, Acceptance and Production), you would like to deploy them easily in the next environment. Sure it is inevitable to encounter some dependencies in an environment, e.g. a root folder or an environment variable, but at least you can try to create SSIS packages with as minimal as possible maintenance.

First, I will depict all the available possibilities with package configurations. Then I will present a couple of scenarios explaining their advantages and disadvantages. Finally, I will demonstrate the best practice!

Please note that this article is based on SQL Server 2008 R2.

Package Configurations explained

I am not going to explain package configurations in details because there are enough on hand information about this subject online. But I will explain the available options with package configurations and their minus and plus points.

There are a couple of parameters involved when developing a DTAP proof SSIS environment. These parameters are:

Connection in the connection manager.
Type of package configuration (XML Configuration file, environment variable, registry entry, parent package variable and SQL Server).
The way the connections are set up (one connection per package configuration or one package configuration for all connections).

There are three methods of altering your package properties each time you run the package (without the need to edit the package in BIDS manually):

The /SET, /CONFIG or the /CONN of the DTEXEC command prompt utility.
Property Expressions.
Package Configurations.

Below the scenarios are presented in a diagram:

Scenario 1 (environment variable points to the root folder for all the SSIS projects).

The first scenario, which will be depicted is the one where the environment variable points to a root folder for all of the SSIS projects. In this scenario there are three steps to be taken:

1. Setup an environment variable (e.g. SSIS Root) which points to a root folder for all the SSIS projects.

2. One XML configuration file in a project folder, which points to a configuration database (where all other connections are stored).

3. Setting all of the variables in the package with the information from the package configuration table (e.g. project root folder, folder structure, file names for checkpoints, etc).

Advantages

All the data and information in the SQL Server is backed-up (excepting the XML configuration file).

Disadvantages

There are some disadvantages with this approach:

Since the strategy is to conduct as less as manual labor when deploying the SSIS packages in a DTAP environment, a couple of steps are not necessary. For instance, the XML configuration file should be set during the runtime (DTEXEC /SET) or extra environment variables must be used (one per connection). The main source of this problem is that it isn’t possible to set a path between a configuration file with another.
Where to put the master XML configuration file? In every project or one general folder?
Security implications..
Less maintainability.

Scenario 2 (Using XML files for connection to a database).

In this scenario an XML configuration file is used for connecting to a database. The downside is that for every database an XML configuration file is needed. However, there are two available options: First, fixed paths, and second,setting the paths with the /CONFIG. Since the fixed path is not very flexible, the second option would be the best. However, in my opinion, it is also not a desired solution.

Advantages

Simple copy deployment.

Disadvantages

What to do with (project) variables? An extra configuration file?
Because the strategy is to do as less as possible when deploying packages in a DTAP environment, a couple of steps could be obsolete. For instance, the XML configuration file should be set during runtime (DTEXEC /SET) or use fixed paths in the SSIS package. Both solutions do not offer much flexibility. As it has been mentioned earlier: the difficulty is to set up a path between two configuration files.

Scenario 3 (environment variable for a connection to a database).

In the former scenario was mentioned that a fixed path to an XML configuration file is not very flexible. In this scenario an environment variable is used for connecting to a database. This is done for every database, resulting in a lot of environment variables. An environment variable needs to be created for every database on the server.

Advantages

Simplicity.

Disadvantages

Creating environment variables, which is mostly done by administrators on a production environment (and a acceptance environment). They won’t be happy when ask for a new environment variable.
Inflexibility.
Less security.
Less maintainability.
Reloading the package in BIDS every time when creating an environment variable.

Scenario 4 (environment variable points to the root folder for all the SSIS projects and store all the information in one XML file).

The next option is to create one XML configuration file for the project (solution) where all the package configurations can be stored in. An environment variable points to the root of project. Therefore, every project needs an environment variable.

Advantages

· Simple copy deployment.

· All information in one file.

Disadvantages

Creating environment variables is mostly done by administrators on a production environment (and acceptance).
Inflexibility.
Less security.
Less maintainability.
Reloading the package in BIDS every time an environment variable is created.
When one configuration file is created per project, multiple configurations for database connections can occur in multiple projects. So, a lot of unnecessary configurations would be stored which consequently would result in a large number of manual editing labor when transferring a databases from one server to another.

Scenario 5 (environment variable points to general configuration database)

The idea of the following alternative came up during a discussion with a colleague. The idea was to use an environment variable for connection to a SQL Server database (the configuration database) where all the context information of an environment are stored such as, a root folder, the folder structure of the project, connections to the database, variables, etc. This is the place where all the other information are stored. Only two entries needed in the package configuration: one for the environment variable, and one for storing all information of the connections (variables, etc) in the package configuration table.

In general there are 3 steps:

1d 1. The database connection together with the package configuration table in the environment variable
is read by the package.

The database with the package configuration table is read.
Setting all of the variables with the information from the package configuration table (e.g. Root folder, folder structure, filenames for checkpoints, etc).

Please check the following displays:

1. First create a project with two connections to the AdventureWorks2008R2 database: connection 1 and connection 2.

2. Create an environment variable "Meta Services" and point it to the Meta Services database in which you create the package configuration stable. Create a package configuration Meta Services and point it to the environment variable "Meta Services". It would be something like this:

3. Create some variables for testing purposes:

4. Create a folder structure:

5. Create an Execute SQL task:

6. In this example I use an expression in the Checkpoint File Usage for testing whether the file path is set properly.

Okay, this is it. Now we can test the result. Add an extra variable varTest in the package. Make sure not to change anything in the package configuration definition. Insert the varTest configuration in the configuration table in the database and check whether SSIS shows the information from the configuration stable in the package.

1. Create the testvariable:

2. Add a value in the SQL Server configurations table

3. And this works:

So once you set up the configuration on a package level, you won’t need to go through the package configuration anymore. This indicates its flexibility.

But this solution has its own particular downsides! In the example above, only one project was displayed and most of the posts and articles I’ve seen so far point out to only one project. There are two available options is this case: Either using the existing project together with the connections, or create a new project with a new configuration filter. Here are some considerations (http://toddchitt.wordpress.com/2008/06/27/ssis_config/):

1. Suppose you have 4 standard database connections managers that you use and create your own SQL Configuration ‘filter’ to save all 4 connection strings (resulting in 4 rows of data in the table, but all with the same filter value). Suppose the NEXT package you are creating only needs 3 of those 4 connections. When you connect to that filter and choose “Reuse Existing” it is still going to hook the three Connection Managers’ Connection String property to the Configuration, BUT when it tries to RUN the package, it will try to configure 4 Connection Managers, but only find 3. consequently, an error will occur though it will still run. The following package validation error occurs: “Error 2, Error loading Package.dtsx: The connection "Connection" is not found. This error is thrown by Connections collection when the specific connection element is not found.”

Creating a new configuration filter for every project would result in multiple configuration filter groupings with multiple definitions of connection to a database. So, when the database connection strings changes, multiple updates are needed. However, this option could provide some flexibility of developing in multiple teams or when a database is being upgraded (an old and new version exists).

The configurationtable in the MetaServices database can be divided in different sections: master configurations, database connections and project configurations. In the master configurations section reside the master definitions like a root folder for all of your SSIS projects. In the database connections section, every connection is stored once, and in the project configurations section, project variables could be stored.

Advantages

Flexibility.
Strong cohesion in the SSIS package and less with the environment.
Maintainability.
High Security.

Disadvantages

A configuration database is needed (master or perhaps per project).
Not easily transportable. One needs to create scripts to port from one environment to another.
Problem of reusability of connections (error) or multiple configurations filters with the same definition of a connection (update problem).

Scenario 6 (indirect configuration and one configuration per connection in a configuration database)

This scenario is consisted of one package configuration with an indirect configuration (environment variable) to a configuration database. For every connection a reference is created in the configuration table and configuration database and for every connection a reference is created in the package configuration. Ray Barley blogged about this on MSSQLTIPS.

So when you are creating a connection, you need to go through three steps: 1. create the connection, 2. create the package configuration, and 3. create a reference in de configurations table. These steps are displayed in the window below. The Parent package variable is used for transferring audit information from the parent package.

Advantages

Flexibility.
Strong cohesion in the SSIS package and less with its environment.
Maintainability.
High Security.

Disadvantages

A configuration database is needed (master or per project).
Not easily transportable. One needs to create scripts to port from one environment to another.
Need one more step when creating a connection (more than scenario 5).
Another disadvantage could come up when developing in teams and have multiple versions of database. Teams desire different connection settings to different databases of the same application.

Conclusion

The 'problem' of using an XML configuration file is that you have to specify the path or you could set a path dynamically with another package configuration (e.g. environment variable). Scenario 5 has some downsides of defining the same connections several times

The best option so far is the scenario 6. It is simple and straightforward!

I take this opportunity to thank Arjan Fraaij for contributing to this post!

Hennie de Nooijer