Performing Change Data Capture (CDC) without Timestamp Columns using SQL

Performing Change Data Capture (CDC) without Timestamp Columns using SQL

In this blog post, we'll explore how to achieve Change Data Capture (CDC) using SQL in scenarios where both the source and target tables lack a timestamp column. CDC is a critical process that helps identify and track changes to data over time, enabling us to synchronize the target table with the latest data from the source.

Let's consider a scenario where we have a dataset with the following columns: id as the primary key, name, and location. Both the Source_table and Target_table do not contain any timestamp column, and we will perform a full data dump into the Target_table as an initial load.

Step 1: Full Initial Load

The first step is to perform a full dump of data from the Source_table to the Target_table. This step ensures that the Target_table is initially populated with all the data.

-- Full initial load from Source_table to Target_table
INSERT INTO Target_table (id, name, location)
SELECT id, name, location
FROM Source_table;

Step 2: Incremental Load and CDC

To capture incremental changes (inserted, updated, or deleted records) between the Source_table and Target_table, we'll create a staging table to hold the changes. We will use the below SQL query to identify the changes and store them in the Staging_table.

-- Perform Change Data Capture (CDC) and store the results in the Staging_table
INSERT INTO Staging_table (id, name, location, flag)
SELECT b.id, b.name, b.location, 'Updated' AS flag
FROM Target_table a
JOIN Source_table b ON (a.id = b.id)
WHERE CONCAT(COALESCE(a.name, '-'), COALESCE(a.location, '-')) != CONCAT(COALESCE(b.name, '-'), COALESCE(b.location, '-'))
UNION ALL
SELECT a.id, a.name, a.location, 'Inserted' AS flag
FROM Source_table a
LEFT JOIN Target_table b ON (a.id = b.id)
WHERE b.id IS NULL
UNION ALL
SELECT a.id, a.name, a.location, 'Deleted' AS flag
FROM Target_table a
LEFT JOIN Source_table b ON (a.id = b.id)
WHERE b.id IS NULL;

The above query uses a series of UNION ALL operations to capture updated, inserted and deleted records. The 'flag' column is used to indicate the type of change (Updated, Inserted, or Deleted).

Step 3: Apply Changes to Target_table

Finally, we'll apply the changes stored in the Staging_table to update the Target_table. This step ensures that the Target_table reflects the latest data from the Source_table.

-- Apply changes to the Target_table
DELETE FROM Target_table
WHERE id IN (
  SELECT DISTINCT id
  FROM Staging_table
  WHERE flag IN ('Deleted', 'Updated')
);

INSERT INTO Target_table (id, name, location)
SELECT id, name, location
FROM Staging_table
WHERE flag IN ('Inserted', 'Updated');

The DELETE operation removes records from the Target_table that were marked as "Deleted" or "Updated" in the Staging_table. The INSERT operation adds new records or updates existing records in the Target_table based on the data marked as "Inserted" or "Updated" in the Staging_table.

Summary

In this blog post, we demonstrated how to perform Change Data Capture (CDC) without using timestamp columns in both the source and target tables. By leveraging SQL queries, we can efficiently capture incremental changes between the source and target, store them in a staging table, and then apply those changes to update the target table. This approach ensures that the target table remains synchronized with the latest data from the source, providing a reliable and effective CDC solution even in the absence of timestamp columns.

Did you find this article valuable?

Support The Analyst Geek by becoming a sponsor. Any amount is appreciated!