SQL Change Data Capture - CDC

Change data capture utilizes the SQL Server Agent to log insertions, updates, and deletions occurring in a table. So, it makes these data changes accessible to be easily consumed using a relational format. - learn.microsoft.com

This is useful when data from different databases is sucked up into a data lake, this way we can easily keep the data lake current with changes we care about.

Simple POC

You can use sample databases from Microsoft like AdventureWorks or Northwind or you can just invent your own. For simplicity and reasons of being a nerd I did the latter.

A. Create a container for the SQL Server with settings ready for CDC

Local Docker SQL Container
  1. Spin up a Docker SQL Container. This will start the container with Product ID (PID) as Developer Edition.
1
docker run --name cdc-container -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=Password!" -e MSSQL_AGENT_ENABLED=true -p 5901:1433 -d mcr.microsoft.com/mssql/server:2019-latest
  • --name cdc-container names the container, else you will get a random name like evil_ptolemy
  • -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=Password!" -e MSSQL_AGENT_ENABLED=true are environmental variables, MSSQL_AGENT_ENABLED is important as this agent runs the CDC jobs
  • -p 5901:1433 is the port, so 5001 is your machine which maps to 1433 in the container
  • -d means run detached from the console
  1. Connect to the instance from SQL Server Management Studio
1
2
3
Server Name: localhost,5901
Login: sa
Password: Password!
  1. Create the sample database and table, I adapted these from w3schools. The table needs a primary key for CDC to track on.
1
2
3
4
5
6
7
8
9
10
CREATE DATABASE databasename;

USE databasename;
CREATE TABLE person (
id int IDENTITY(1,1) PRIMARY KEY,
last_name varchar(255),
first_name varchar(255),
address varchar(255),
city varchar(255)
);

You should now see the database in the object explorer:

databasename in object explorer

B. Confirgure the SQL database to track CDC changes

Setup CDC
  1. The ownership of the table should be sa, if you restored a sample database you will need to run ALTER AUTHORIZATION before verifying the owner with the SQL below
1
2
3
4
SELECT d.name, d.owner_sid, sl.name
FROM sys.databases AS d
JOIN sys.sql_logins AS sl
ON d.owner_sid = sl.sid;

databasename owner

  1. Enable tracking changes, you can also disable with the commented out procedure below it.
1
2
EXEC sys.sp_cdc_enable_db
--EXEC sys.sp_cdc_disable_db
  1. Verify table changes now exists, it will be blank but this means the above was successful. The cdc schema tables are surfaced under System Tables
1
SELECT * FROM cdc.change_tables

table changes

  1. Enable changes for the table person.
1
2
3
4
5
6
7
8
9
10
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo'
, @source_name = N'person'
, @role_name = NULL
-- , @capture_instance = N'foo_person_changes'
, @capture_instance = NULL
, @supports_net_changes = 1
, @captured_column_list = N'id, last_name, first_name'
, @filegroup_name = N'PRIMARY';
GO
  • capture_instance is the new table name. NULL will set the default based on the params example cdc.dbo_person_CT. You could set this to be foo_person_changes_CT (the _CT is always suffixed by default)
  • supports_net_changes is for when the column changes frequently, so only capture the net change (ie the last one)
  • captured_column_list these are column you want to track, you have to include the primary key

This will take a few seconds to complete as its creating jobs. The capture job captures the data as it changes from the log file. The cleanup job purges the captured data, it will only keep it for 3 days by default. You can change this with @retention= which needs to be in minutes.

cdc jobs

If you want to disable the table you can use this SQL

1
2
3
4
EXEC sys.sp_cdc_disable_table
@source_schema = N'dbo'
, @source_name = N'person'
, @capture_instance = N'dbo_person'
  1. Checking the table changes should now surface that person is being tracked
1
SELECT * FROM cdc.change_tables

person is being tracked

  1. Now query the change table, it should be empty
1
SELECT * FROM cdc.dbo_person_CT

empty change table

  1. Knowing what each change table column is used for helps understand the data when quering it
  • _$start_lsn commit log sequence number (LSN) within the same Transaction
  • _$seqval order changes within a transaction
  • _$operation 1=delete, 2=insert,3=updatebefore, 4=updateafter
  • _$update_mask for insert and delete all bits are set, for update bits set correspond to columns changed

C. Test and view the CDC changes

Test CDC

Now we can finally test the tracked changed

  1. INSERT some data, this is _$operation 2
1
2
3
4
INSERT INTO databasename.dbo.person 
(last_name,first_name,address,city)
VALUES
('paton', 'carl', '10 road name', 'durban');
  1. UPDATE the data, this is _$operation 3 and 4
1
2
3
UPDATE databasename.dbo.person
SET last_name = 'foo', first_name = 'bar'
WHERE id = 1
  1. DELETE the data, this is _$operation 1
1
2
DELETE FROM databasename.dbo.person 
WHERE id = 1
  1. Query the changes
1
2
3
4
5
6
7
8
-- INSERTS
SELECT * FROM cdc.dbo_person_CT WHERE __$operation = 2

-- UPDATE
SELECT * FROM cdc.dbo_person_CT WHERE __$operation IN (3,4)

-- DELETE
SELECT * FROM cdc.dbo_person_CT WHERE __$operation = 1

Here you can see for __$operation = 2 new data was created

insert changes

For __$operation IN (3,4) we have the update before (3) and update after (4) and for both __$seqval is the same meaning it was in the same transaction

update changes

For __$operation = 1 new data was deleted

delete changes

References