In a former post I already introduced the DQS Quality Client and the installation of DQS of SQL Server 2012 RC0 Denali. The main advantage of DQS lies in the usage of the SSIS DQS component "DQS Cleansing" in SQL Server Integration Services. You can implement the DQS component in an ETL process in order to cleanse the information from source systems.
James Beresford investigated already the performance of the DQS component and it seems that usage of the component cost quite some performance and therefore you shouldn't use it on a large tables (yet?).
You can find more info, here:
DQS Cleansing component
In this blogpost i'll discover the DQS SSIS component. I'll show you the steps that i took in order to build a SSIS package with the DQS Cleansing component. The first thing i did was creating a package and a drop the DQS Cleansing component on the pane.
An input stream and an output stream is needed.
With the source- and the destination assistent you can create a source and destination adapter. When i double click on the server for connecting to the Knowledge base an error occurs.
I tried some some potential solutions like reopening the component and reopening the package. This didn't help. So i decided to reboot SQL Server instance and the error disappeared.
Something is wrong here. After some adjustments and clicking the test connection button i ran in an error, again. The error below is shown when the test connection button in the connection manager is pressed.
Again, I restarted the SQL Server service and now it seems to be working again. Next, i'm setting up the mapping.
Below the tab with some checkboxes.
In the destination adaptersome fields are not listed in the SSIS pipeline.
I remembered that on the advanced tab there were some checkboxes: Confidence and Reason. I enabled this checkboxes.
Now these fields are selectable in the Destination adapater.
Below you can see the new progress icons. Nice!
Finished and succeeded
And below you can see the result of this little exercise.
This seems an useful component of SSIS but when i read the blogs of James Beresford i'm getting less enthusiastic. The main reason is the perfomance of the DQS component. He calculates that for instance 1 million rows will take about 4 hours of loading and that is not acceptable in my opinion. In an other blogpost he states that performance can be optimized by parallelism but it's still not very usable for large tables (dimensions).
In my current project i'm working with a 2 million customer table (dimension) and this seems too large for this DQS SSIS component. So James advices to use the DQS for midsized tables. In my assumption this is like around 100 K - 500K rows.
There is some work to do for the development team:
- The error connecting to the Server.
- Hopefully they can improve the performance to higher level