One subject that I'm interested in is Data Quality Services together with the features like Master Data Services. In this blogpost i'll try to discover the basics of DQS. I investigated this feature in CTP1 already but it wasn't ready yet, unfortunately. In CTP 3 (and RC0) the DQS seems more and more a finished product and i think it's time to discover some of the features in this post.
DQS is all about the quality of data in databases. The reason why this is a interesting feature, is that it can be used in a ETL for a datawarehouse. If a source system has a lack of good dataquality, DQS can improve the quality of data during the load into the datawarehouse.
On MSDN i learn that DQS contains three technical components:
- DQS Server. Two databases :
- DQS_MAIN. DQS_MAIN db contains all the DQS stored procedures, its engine and the published knowledge bases.
- DQS_PROJECTS. The DQS_PROJECTS db contains all the data quality projects information.
- DQS client. The DQS client is a standalone application that enables you to connect to a DQS server, providing you with a GUI to perform your data-cleansing tasks, as well as other administrative tasks related to DQS.
- DQS Cleansing SSIS. The DQS Cleansing SSIS component enables you to integrate the DQS functionality into an SSIS data flow, for batch execution.
1. Install DQS Server
The first thing we have to execute is installing the Data Quality Server. The installer seems to be available in the start menu. A more obvious place would be the installer of SQL Server 2012 but perhaps this is a temporary solution.
A master key is needed for installing.
And a succes is reported:
Let's start the SQL Server Management Studio and see whether the databases have been installed:
There seems to be another database present : DQS_STAGING_DATA.
2. Grant DQS roles
According to the video i need to grant a user administrator rights.
3. Turn on TCP/IP with configuration tool
No window available.
For installing this on a VM step 2 and 3 are not necessary.
Using DQS client
Now its time to use the DQS client. As said earlier, the DQS client is a stand alone application that can be used by power users to create a knowledge base and business rules to check the dataquality.
1. DQS Client
First thing we have to do is starting the DQS client.
Click on the Connect button and the following window is presented:
There are three areas that are interesting:
- Knowledge base Management. This is a important part of the DQS infrastructure. In this area we define the rules when the data is validated.
- Data Quality projects. Create a new DQS project.
- Administration. This covers Activity Monitoring and Configuration and is used for monitoring the usage of the knowledge base and connecting to 3rd party reference databases.
2. Open a Knowledge base
Let's take a look whether this a pre installed Knowledge base.
Here you can see an example of a table with domain values that should be converted to the right country.
Okay let's try to create our own new knowledge base and for this test i'll be creating a domain list of two dutch places : Amsterdam and Utrecht.
And the following window appears:
Once the above step is completed we will proceed with the Domain Management Section. We need to create a new domain.
Enter a Domainname. The window below is more extended than the CTP3.
Press Ok and the following window appears.
Create a new domain
Enter the values. There is a distinction between the valid value and the incorrect values that needs to be converted to correct value. For instance 'Ams' is converted to 'Amsterdam'.
You need to publish this to the DQS Server.
Succeeded as the window shows:
The new knowledge base is present in Recent Knowledge base window
When this is finished it's possible to create a DQS project. This seems an extra possibility for the poweruser to correct some values for themselves. As stated earlier, there is also a DQS SSIS component that can be used for quality checks in the ETL proces.
Define the DQS project in more detail like where is the source file and where do you want to store the results:
This is the source table:
This is the window when all information is entered.
After pressing the start button the following window appears.
Categorizing the values.
And here are the results:
And the table in SQL Server shows the following information:
DQS seems a great tool in the Self Service BI approach. The poweruser can maintain the knowledge base with little help of IT and the ETL developer can use this DQS functionality in the ETL process. This will aid the dataquality of the data. So the ETL developer will be more responsible for building the ETL package and the poweruser is content responsible.
In one of the following blogposts i'll investigate the SSIS DQS component.