Member-only story

AWS Glue + Apache Iceberg

Bringing ACID operations to Apache Glue and SparkSQL

Robert Sanders

Published in

Clairvoyant Blog

2 min readNov 12, 2022

Motivation

At Clairvoyant, we work with a large number of customers that use AWS Glue for their daily ETL processes. Many of these Glue jobs leverage SparkSQL statements to make transformations easier to understand and more readable.

We’ve been looking to identify ways to further make these sorts of SparkSQL operations easier. Mainly through providing ACID Operations (UPSERTs, INSERTs, UPDATEs, and DELETEs) in SparkSQL. For example, performing a DELETe would be simpler from an understanding and execution perspective rather than doing an INSERT OVERWRITE back into a table as you would typically do in Spark. There is a new file format that provides just that: Apache Iceberg.

This can also potentially help to improve AWS S3 costs and storage efficiency with the ability to only store Delta data and metadata:

Apache Iceberg Reduced Our Amazon S3 Cost by 90%

Apache Iceberg is dramatically cost-effective compared to Apache Hive. Also enables more performant Spark jobs and…

medium.com

This post will describe how you can configure you AWS Glue job to use Iceberg in SparkSQL through some simple examples.

Glue Job Configurations

Iceberg JARs

First, we will need the proper JARs to be loaded into S3 for use in the Glue Job

The following JARs should be downloaded and loaded into an S3 Bucket:

Glue Job Details

Ensure the Glue Version is set to “3.0” — This ensures we’re using Spark 3.1 which allows for SparkSQL Keywords that enable ACID operations such as MERGE, DELETE, UPDATE, etc (since these are not available in Spark <=2.4).
Update both the “Python library path” and “Dependant JARs path” to include a comma-separated list of S3 Paths to the required JARs…