Unleashing the Power of AWS Redshift COPY JOB: Streamlined Data Ingestion with Code Examples

Huzefa Khan
3 min readNov 12, 2023

--

In the fast-paced world of data warehousing, the efficiency of data ingestion is crucial. AWS Redshift, a leading data warehousing solution, has introduced a game-changing feature in its arsenal: the Redshift COPY JOB. This new functionality, currently in preview, automates the loading of data from Amazon S3 to Redshift tables, revolutionizing how data teams manage ETL processes. Let’s dive into the nuts and bolts of this feature with some practical code examples.

What is Redshift COPY JOB?

The Redshift COPY JOB is an extension of the well-known COPY command, which is essential for bulk loading data into Redshift from various sources like Amazon S3. What sets COPY JOB apart is its capability to automate this process, detecting new files in an S3 bucket and loading them into Redshift without manual intervention.

Key Advantages:

  1. Automation: Automatically detects and loads new files from an S3 path.
  2. Efficiency: Utilizes Redshift’s MPP architecture for fast, parallel data loading.
  3. Simplicity: Reduces the need for external scheduling tools or scripts.

Prerequisites

Before using COPY JOB, ensure you have:

  1. An AWS Redshift Cluster.
  2. Data in an Amazon S3 bucket.
  3. An IAM Role with necessary permissions for Redshift to access S3.

Syntax and Code Examples

Basic COPY JOB Creation

COPY public.sales
FROM 's3://mydata-bucket/sales-data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
JOB CREATE sales_data_job
AUTO ON;
  • This command creates a COPY JOB named sales_data_job.
  • It automatically loads data from the sales-data S3 bucket into the public.sales table in Redshift.
  • AUTO ON ensures continuous monitoring and loading of new files.

Altering a COPY JOB

To change a COPY JOB’s settings:

COPY JOB ALTER sales_data_job AUTO OFF;
  • This alters the sales_data_job to stop automatic loading of new data.

Running a COPY JOB Manually

COPY JOB RUN sales_data_job;
  • Manually runs the sales_data_job, useful if AUTO is set to OFF.

Viewing COPY JOB Details

To list and show details of COPY JOBS:

COPY JOB LIST;
COPY JOB SHOW sales_data_job;
  • Lists all COPY JOBS and shows details of sales_data_job.

Deleting a COPY JOB

COPY JOB DROP sales_data_job;
  • Removes the sales_data_job from Redshift.

Best Practices and Considerations

  • Initial Setup: Ensure the S3 bucket path specified in the COPY JOB is initially empty.
  • Data Formats: COPY JOB doesn’t support ORC and Parquet formats.
  • IAM Role: Use specific IAM roles; default cluster roles are not supported.
  • Error Handling: Since MAXERROR is not supported, ensure data quality in S3.
  • Monitoring: Regularly check system tables like STL_LOAD_ERRORS for any load issues.

Conclusion

The introduction of COPY JOB in AWS Redshift marks a significant leap in data loading efficiency. It empowers teams to automate their ETL processes, ensuring that their data warehouses are continually updated with the latest data with minimal manual oversight. As this feature is still in preview, it’s advisable to experiment and become familiar with its capabilities in a test environment before deploying it in production. With the insights and examples provided, you’re now equipped to leverage this powerful feature in your data management strategy.

Reference:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY-JOB.html

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet