The Versatility of SQL MERGE Statement in Data Warehousing

Huzefa Khan
3 min readDec 31, 2023

--

SQL, the cornerstone of data management, incorporates a range of robust commands designed for efficient data manipulation. Among these commands, the MERGE statement is notable for its capacity to execute multiple operations — INSERT, UPDATE, and DELETE — within a single, cohesive command. In the landscape of data warehousing, where frequent data synchronization and manipulation tasks occur, the MERGE statement emerges as a pivotal tool.

PostgreSQL 15 introduces a fresh SQL Command named MERGE. Although MERGE has existed within the SQL standard for some time, it has recently been incorporated into the PostgreSQL codebase. Before the arrival of MERGE, if there was a need to update a target table from a source table in PostgreSQL versions prior to 15, the “upsert” method with the ON CONFLICT clause was the typical approach.

Now, MERGE can be used instead! Some situations where MERGE makes a lot of sense are:

Data loading from external sources, thru foreign data wrappers

Staged and batched process jobs

Understanding the MERGE Statement

The MERGE statement functions as a Swiss Army knife for data warehouses, allowing simultaneous handling of data from source to target tables. Its structure typically involves a source dataset and a target dataset, using conditions to determine actions.

MERGE INTO target_table AS target
USING source_table AS source
ON target.matching_column = source.matching_column
WHEN MATCHED THEN
-- Update operation
UPDATE SET target_column = source_column
WHEN NOT MATCHED THEN
-- Insert operation
INSERT (column1, column2, ...)
VALUES (value1, value2, ...)
WHEN NOT MATCHED BY SOURCE THEN
-- Delete operation
DELETE;

Step 1: Create a Temporary Dataset

WITH fetch_user AS (
SELECT
user_id,
username,
name ,
first_name,
last_name,
email,
address,
MD5(user_id || username || email || address) AS upd_hash_code
FROM raw.users
)

Step 2: Apply the MERGE Statement

MERGE INTO process.users AS tgt  
USING fetch_user AS src
ON tgt.ext = src.user_id
WHEN NOT MATCHED THEN

INSERT (ext, name, first_name, last_name, tag_id, username, email, address, updated_date, upd_hash_code)
VALUES (1, 1, CAST(src.user_id AS INT), src.name, src.first_name, src.last_name, src.utag_id, src.username, src.email, src.address, CURRENT_TIMESTAMP, src.upd_hash_code)

WHEN MATCHED AND src.upd_hash_code <> tgt.upd_hash_code THEN
UPDATE SET
name = src.name,
first_name = src.first_name,
last_name = src.last_name,
tag_id = src.utag_id,
username = src.username,
email = src.email,
address = src.address,
updated_date = CURRENT_TIMESTAMP,
upd_hash_code = src.upd_hash_code;

Including MD5 Hashing for Data Integrity

In the process of synchronizing user data using the SQL MERGE statement in data warehousing, an additional layer of data integrity is introduced through MD5 hashing. The MD5 hashing algorithm plays a crucial role in this scenario by creating a unique hash based on concatenated values from pertinent columns of the source dataset. This resultant hash, upd_hash_code, serves as a checksum—a fingerprint, so to speak—of the combined values of user_id, username, email, and address columns from the raw.users table. The generated hash provides a streamlined means to compare data between the source and target tables. Within the MERGE operation, comparing upd_hash_code values helps swiftly identify any changes within the specified columns. This approach not only streamlines the comparison process but also ensures data consistency and aids in detecting alterations within the source data, contributing significantly to the robustness and reliability of the synchronization process.

Conclusion

The SQL MERGE statement provides a robust mechanism for managing data in data warehousing scenarios. Its ability to handle different operations based on specified conditions makes it a versatile choice for synchronizing and manipulating datasets efficiently.

In the context of synchronizing user data, as showcased in the example, the MERGE statement simplifies the process of aligning information between source and target tables while ensuring data consistency.

Harnessing the power of the MERGE statement streamlines data maintenance in data warehousing environments, facilitating seamless operations and maintaining data integrity.

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet