AWS Glue is a serverless data integration service that allows you to process and integrate data coming through different data sources at scale. AWS Glue 5.0, the latest version of AWS Glue for Apache Spark jobs, provides a performance-optimized Apache Spark 3.5 runtime experience for batch and stream processing. With AWS Glue 5.0, you get improved performance, enhanced security, support for the next generation of Amazon SageMaker, and more. AWS Glue 5.0 enables you to develop, run, and scale your data integration workloads and get insights faster.
AWS Glue accommodates various development preferences through multiple job creation approaches. For developers who prefer direct coding, Python or Scala development is available using the AWS Glue ETL library.
Building production-ready data platforms requires robust development processes and continuous integration and delivery (CI/CD) pipelines. To support diverse development needs—whether on local machines, Docker containers on Amazon Elastic Compute Cloud (Amazon EC2), or other environments—AWS provides an official AWS Glue Docker image through the Amazon ECR Public Gallery. The image enables developers to work efficiently in their preferred environment while using the AWS Glue ETL library.
In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0 .
Available Docker images
The following Docker images are available for the Amazon ECR Public Gallery:
- AWS Glue version 5.0 –
ecr.aws/glue/aws-glue-libs:5
AWS Glue Docker images are compatible with both x86_64
and arm64
.
In this post, we use public.ecr.aws/glue/aws-glue-libs:5
and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue 5.0 Spark jobs. The image contains the following:
To set up your container, you pull the image from the ECR Public Gallery and then run the container. We demonstrate how to run your container with the following methods, depending on your requirements:
spark-submit
- REPL shell (
pyspark
) pytest
- Visual Studio Code
Prerequisites
Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac, Windows, or Linux. Also make sure that you have at least 7 GB of disk space for the image on the host running Docker.
Configure AWS credentials
To enable AWS API calls from the container, set up your AWS credentials with the following steps:
- Create an AWS named profile.
- Open cmd on Windows or a terminal on Mac/Linux, and run the following command:
In the following sections, we use this AWS named profile.
Pull the image from the ECR Public Gallery
If you’re running Docker on Windows, choose the Docker icon (right-click) and choose Switch to Linux containers before pulling the image.
Run the following command to pull the image from the ECR Public Gallery:
Run the container
Now you can run a container using this image. You can choose any of following methods based on your requirements.
spark-submit
You can run an AWS Glue job script by running the spark-submit
command on the container.
Write your job script (sample.py
in the following example) and save it under the /local_path_to_workspace/src/
directory using the following commands:
These variables are used in the following docker run
command. The sample code (sample.py
) used in the spark-submit
command is included in the appendix at the end of this post.
Run the following command to run the spark-submit
command on the container to submit a new Spark application:
REPL shell (pyspark)
You can run a REPL (read-eval-print loop) shell for interactive development. Run the following command to run the pyspark command on the container to start the REPL shell:
You will see following output:
With this REPL shell, you can code and test interactively.
pytest
For unit testing, you can use pytest
for AWS Glue Spark job scripts.
Run the following commands for preparation:
Now let’s invoke pytest
using docker run
:
When pytest
finishes executing unit tests, your output will look something like the following:
Visual Studio Code
To set up the container with Visual Studio Code, complete the following steps:
- Install Visual Studio Code.
- Install Python.
- Install Dev Containers.
- Open the workspace folder in Visual Studio Code.
- Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac).
- Enter
Preferences: Open Workspace Settings (JSON)
. - Press Enter.
- Enter following JSON and save it:
Now you’re ready to set up the container.
- Run the Docker container:
- Start Visual Studio Code.
- Choose Remote Explorer in the navigation pane.
- Choose the container
ecr.aws/glue/aws-glue-libs:5
(right-click) and choose Attach in Current Window.
- If the following dialog appears, choose Got it.
- Open
/home/hadoop/workspace/
.
- Create an AWS Glue PySpark script and choose Run.
You should see the successful run on the AWS Glue PySpark script.
Changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image
The following are major changes between the AWS Glue 4.0 and Glue 5.0 Docker image:
- In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from AWS Glue 4.0, where there was one image for batch and another for streaming.
- In AWS Glue 5.0, the default user name of the container is hadoop. In AWS Glue 4.0, the default user name was glue_user.
- In AWS Glue 5.0, several additional libraries, including JupyterLab and Livy, have been removed from the image. You can manually install them.
- In AWS Glue 5.0, all of Iceberg, Hudi, and Delta libraries are pre-loaded by default, and the environment variable
DATALAKE_FORMATS
is no longer needed. Until AWS Glue 4.0, the environment variableDATALAKE_FORMATS
was used to specify whether the specific table format is loaded.
The preceding list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see Introducing AWS Glue 5.0 for Apache Spark and Migrating AWS Glue for Spark jobs to AWS Glue version 5.0.
Considerations
Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally:
Conclusion
In this post, we explored how the AWS Glue 5.0 Docker images provide a flexible foundation for developing and testing AWS Glue job scripts in your preferred environment. These images, readily available in the Amazon ECR Public Gallery, streamline the development process by offering a consistent, portable environment for AWS Glue development.
To learn more about how to build end-to-end development pipeline, see End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue. We encourage you to explore these capabilities and share your experiences with the AWS community.
Appendix A: AWS Glue job sample codes for testing
This appendix introduces three different scripts as AWS Glue job sample codes for testing purposes. You can use any of them in the tutorial.
The following sample.py code uses the AWS Glue ETL library with an Amazon Simple Storage Service (Amazon S3) API call. The code requires Amazon S3 permissions in AWS Identity and Access Management (IAM). You need to grant the IAM-managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or IAM custom policy that allows you to make ListBucket and GetObject API calls for the S3 path.
The following test_sample.py code is a sample for a unit test of sample.py:
Appendix B: Adding JDBC drivers and Java libraries
To add a JDBC driver not currently available in the container, you can create a new directory under your workspace with the JAR files you need and mount the directory to /opt/spark/jars/
in the docker run
command. JAR files found under /opt/spark/jars/
within the container are automatically added to Spark Classpath and will be available for use during the job run.
For example, you can use the following docker run
command to add JDBC driver jars to a PySpark REPL shell:
As highlighted earlier, the customJdbcDriverS3Path
connection option can’t be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images.
Appendix C: Adding Livy and JupyterLab
The AWS Glue 5.0 container image doesn’t have Livy installed by default. You can create a new container image extending the AWS Glue 5.0 container image as the base. The following Dockerfile demonstrates how you can extend the Docker image to include additional components you need to enhance your development and testing experience.
To get started, create a directory on your workstation and place the Dockerfile.livy_jupyter
file in the directory:
The following code is Dockerfile.livy_jupyter
:
Run the docker build command to build the image:
When the image build is complete, you can use the following docker run command to start the newly built image:
Appendix D: Adding extra Python libraries
In this section, we discuss adding extra Python libraries and installing Python packages using
Local Python libraries
To add local Python libraries, place them under a directory and assign the path to $EXTRA_PYTHON_PACKAGE_LOCATION
:
To validate that the path has been added to PYTHONPATH
, you can check for its existence in sys.path
:
Installing Python packages using pip
To install packages from PyPI (or any other artifact repository) using pip, you can use the following approach:
About the Authors
Subramanya Vajiraya is a Sr. Cloud Engineer (ETL) at AWS Sydney specialized in AWS Glue. He is passionate about helping customers solve issues related to their ETL workload and implementing scalable data processing and analytics pipelines on AWS. Outside of work, he enjoys going on bike rides and taking long walks with his dog Ollie.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.