AWS Redshift Upsert (Merge)

Huzefa Khan
3 min readJan 31, 2023

--

The process of loading data into a data warehouse often involves using “UPSERT”, a combination of update and insert operations. This is necessary when the data contains both new and existing information, with the new data needing to be inserted and the existing data needing to be updated. Typically, this is done when moving data from a staging area to the actual data warehouse. For those familiar with SQL Server, the “MERGE” statement, which has additional capabilities such as deletion, can be used for this purpose. However, the author prefers not to use “MERGE” and instead writes a procedure for implementing “UPSERT”.

Redshift does not have the capability of a single “MERGE” statement, therefore in order to execute an “UPSERT”, the procedure described in the document “Updating and Inserting New Data” must be followed.

Use a staging table to perform a merge (upsert)
Updating and inserting new data

what we basically do is the following steps:

  1. Creating a temporary staging table
  2. Load the data into the staging table.
  3. Delete rows that exist in the data warehouse and the staging table (they should have been update)
  4. Insert all rows from staging table into the data warehouse

The staging table is a temporary table that holds all of the data that will be used to make changes to the target table, including both updates and inserts.

for example we can have a fact_sales table and we can have it both in “stg” and “dwh” schemas. We use our process to load new data from the last few days to the stg table and now we want to load it to the dwh table.

-- Delete rows from the dh table if they exists in the stg table
delete from dwh.fact_sales
using stg.fact_sales
where dwh.fact_sales.salesid = stg.fact_sales.salesid;

Then we deleted that rows from the dwh we can use an insert statement

-- Insert all the rows from the stg table into the dwh table
insert into dwh.fact_sales
select * from stg.fact_sales;

The process of using “UPDATE” and “INSERT” requires two scans, however, by using “DELETE” and “INSERT” on all staging rows, only one scan is needed. This leads to improved efficiency. However, there is an even faster method that is the focus of the article.

By keeping the source (staging) and target (data warehouse) tables with the same exact structure, it’s possible to use the “ALTER — APPEND” statement which is faster than the “INSERT” statement. This statement transfers data from the source to the target table.

-- Insert all the rows from the stg table into the dwh table using alter
ALTER TABLE dwh.fact_sales APPEND FROM stg.fact_sales;

This statement is faster then insert a can save us some time. When utilizing “ALTER — APPEND”, it’s crucial to understand the following:

  1. The tables must have identical structures, though one may contain additional columns. To accommodate this, additional commands may need to be added.
  2. The statement moves data blocks from one table to another for improved performance, however this leads to a temporary increase in storage usage as the storage is not compact.
  3. Importantly, “ALTER — APPEND” cannot be executed within a transaction, meaning that there is no option to roll back, for example, a delete operation, which might be a concern for some developers to consider.

Reference: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

--

--

Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/