Member-only story
AWS Glue + Apache Iceberg
Bringing ACID operations to Apache Glue and SparkSQL

Motivation
At Clairvoyant, we work with a large number of customers that use AWS Glue for their daily ETL processes. Many of these Glue jobs leverage SparkSQL statements to make transformations easier to understand and more readable.
We’ve been looking to identify ways to further make these sorts of SparkSQL operations easier. Mainly through providing ACID Operations (UPSERTs, INSERTs, UPDATEs, and DELETEs) in SparkSQL. For example, performing a DELETe would be simpler from an understanding and execution perspective rather than doing an INSERT OVERWRITE back into a table as you would typically do in Spark. There is a new file format that provides just that: Apache Iceberg.
This can also potentially help to improve AWS S3 costs and storage efficiency with the ability to only store Delta data and metadata:
This post will describe how you can configure you AWS Glue job to use Iceberg in SparkSQL through some simple examples.
Glue Job Configurations
Iceberg JARs
First, we will need the proper JARs to be loaded into S3 for use in the Glue Job
The following JARs should be downloaded and loaded into an S3 Bucket:
- org.apache.iceberg:iceberg-spark3-runtime:0.13.1
- software.amazon.awssdk:bundle:2.15.40
- software.amazon.url-connection-client:2.15.40
Glue Job Details
- Ensure the Glue Version is set to “3.0” — This ensures we’re using Spark 3.1 which allows for SparkSQL Keywords that enable ACID operations such as MERGE, DELETE, UPDATE, etc (since these are not available in Spark <=2.4).
- Update both the “Python library path” and “Dependant JARs path” to include a comma-separated list of S3 Paths to the required JARs…