woensdag 15 september 2010

Change Data Capture (CDC)

Change data capture is a way of capturing changed data from a source system within a specified window of time. These changes includes inserts, updates and deletes and the required window of time may vary from minutes, hours to weeks. It’s a new feature since SQL Server 2008 and it’s based on replication and designed with the SSIS team.

CDC is available on certain editions of SQL Server, namely SQL Server 2008 Enterprise, Developer and Evaluation editions. As an ETL developer you could be tempted to turn CDC on for every table, but that is a bad idea. Although CDC has low overhead, it does not have zero overhead.

Below you can see the steps for turning CDC on.

Enable CDC on at database level. This will let SQL Server set up the required infrastructure that we’ll need later. It creates a schema called cdc as well as security, functions and procedures.

USE AdventureWorks2008R2

--enable CDC on the database
EXEC sys.sp_cdc_enable_db

--Check CDC is enabled on the database
SELECT name, is_cdc_enabled
FROM sys.databases
WHERE database_id = DB_ID()

Make sure that the SQL Server agent is running on the same server where CDC is turned on. Start up SQL Server 2008 R2 configuration manager and check whether the agent is running.
Now it’s possible to enable CDC functionality on a table. For instance:
USE AdventureWorks2008R2


--enable CDC on a specific table
EXECUTE sys.sp_cdc_enable_table
@source_schema = N'HumanResources',
@source_name = N'Employee',
@role_name = N'cdc_Admin',
@capture_instance = N'HumanResources_Employee',
@supports_net_Changes = 1;

And this results in :

Job 'cdc.AdventureWorks2008R2_capture' started successfully.
Job 'cdc.AdventureWorks2008R2_cleanup' started successfully.

You could check whether CDC is enabled for any particular table with the following query:

--Check CDC is enabled on the table

SELECT [name],
FROM sys.tables
WHERE [OBJECT_ID] = OBJECT_ID(N'HumanResources.Employee')

--Or use the CDC help procedure
EXECUTE sys.sp_cdc_help_change_data_capture
@source_schema = N'HumanResources',
@source_name = N'Employee'

So now CDC is turned on for the employee table. When this enabled SQL Server uses a default naming convention to create a shadowtable in the same database, called cdc.HumanResources_employee_CT. This table has the same schema as the source table, but is has extra metadata columns that CDC needs to do its magic.

SELECT * FROM cdc.HumanResources_Employee_CT

__$start_lsn = identifier of the original transaction
__$segval = Order of transactions
__$update_mask =
__$operation = type of operation (1 = delete, 2 = insert, 3 = update (before image), 4 (after image), 5 = merge)

Okay let’s make an update to this table and see what will happen. Hey, for every update there are two records: one before and one after image.

Ok now lets play with some of the API stored procedures and functions. With the following code you can retrieve the actual values of latest present records.

USE AdventureWorks2008R2


DECLARE @dBeginTime AS Datetime = GETDATE() -1
DECLARE @dEndTime AS Datetime = GETDATE()
DECLARE @bFromLsn AS Binary(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @dBeginTime)
DECLARE @bToLsn AS Binary(10) = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @dEndTime)

DECLARE @bMinLsn AS Binary(10)= sys.fn_cdc_get_Min_lsn('HumanResources_Employee')

IF @bFromLsn < @bMinLsn SET @bFromLsn = @bMinLsn

SELECT * FROM cdc.fn_cdc_get_net_changes_HumanResources_Employee(@bFromLsn, @bToLsn, N'all with mask');

Ok let’s use this information with SSIS. We could build a robust and efficient ETL solution.
Thats it for now!!

Geen opmerkingen:

Een reactie posten