Developing and testing AWS Glue job scripts locally on windows

Huzefa Khan
4 min readJan 19, 2025

--

To build a production-ready data platform, setting up a robust development process and CI/CD pipeline for AWS Glue jobs is crucial. AWS Glue provides Docker images on Docker Hub, enabling flexible development and testing of Glue jobs in a Docker container. Setting up AWS Glue locally can be highly beneficial for a variety of reasons, especially for development, testing, and troubleshooting purposes. Here are some key motivations for setting up AWS Glue locally:

Cost Efficiency

Developing and testing Glue scripts locally avoids the need to continuously use AWS Glue resources, saving costs associated with AWS Glue jobs, Data Catalog, and S3 storage during development.

Faster Development Cycle

Running and debugging Glue scripts locally eliminates the delays associated with uploading scripts to AWS, starting Glue jobs, and waiting for logs to appear in CloudWatch. This allows for quicker iterations and troubleshooting.

Ease of Debugging

Local setups allow the use of integrated debugging tools (e.g., PyCharm or VS Code) to step through PySpark or Glue scripts line by line, which is not possible directly in the Glue environment.

Custom Testing Scenarios

Local setups make it easier to create and test custom datasets, use mock data, and simulate scenarios without uploading files to S3 or altering live Glue Data Catalogs.

Offline Development

A local Glue setup enables development without the need for an active internet connection, which is especially helpful in secure or offline environments.

Environment Customization

Local environments allow you to use custom configurations, dependencies, and versions of libraries that might not be readily supported in AWS Glue’s managed environment.

Pre-production Validation

Running scripts locally ensures that they are tested thoroughly before being deployed to the production Glue environment, reducing the risk of failures in live jobs.

Team Collaboration

Developers can share and test Glue scripts in local environments without needing access to AWS accounts, facilitating collaboration and reducing dependency on cloud resources during development.

Steps for Setup the Glue locally

If you are using window then wsl is recommended approach.

Install the following Prerequisites

  1. Install WSL on windows (https://learn.microsoft.com/en-us/windows/wsl/install)

2. Install docker and make sure Docker daemon is running on WSL not on windows(https://docs.docker.com/engine/install/).

sudo apt install libz-dev libssl-dev libcurl4-gnutls-dev libexpat1-dev gettext cmake gcc
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

3. Install AWS-CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip -y
unzip awscliv2.zip
sudo ./aws/install
aws --version

4. Setup aws account using awscli provide the AWS Access Key ID and secret key

aws configure

5. Setup WORKSPACE_LOCATION

WORKSPACE_LOCATION=/home/dev

6. Run the following command to execute the spark-submit command on the container to submit a new Spark application

 docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=default -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark

output of the above command

7. open the working directory on VS code and run the job

click on “Open remote window”

Open remote window
Attached running container
Allow access

Run the job in the terminal make sure the spark is running on wsl terminal

[glue_user@ae37b3b46013 dev]$ python3 cash_adjustment.py - JOB_NAME test

There are a fother options are also available you can check on below reference and some Local development restrictions as well keep it mind while doing the development.

Local development restrictions

Keep the following restrictions in mind when using the AWS Glue Scala library to develop locally.

Avoid creating an assembly jar (“fat jar” or “uber jar”) with the AWS Glue library because it causes the following features to be disabled:

The FindMatches transform is not supported with local development.

The vectorized SIMD CSV reader is not supported with local development.

The property customJdbcDriverS3Path for loading JDBC driver from S3 path is not supported with local development. Alternatively you can download the JDBC driver in your local and load from there.

The Glue Data Quality is not supported with local development.

If you’d like to learn more about AWS Glue or other AWS services, feel free to connect with me at huzzefakhan@gmail.com.

references
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

--

--

Huzefa Khan
Huzefa Khan

Written by Huzefa Khan

Passionate Sr. Data Engineer with years of experience in developing and architecting high-class data solutions https://www.linkedin.com/in/huzzefakhan/

No responses yet