Serverless ETL and Analytics with AWS Glue — Book Review

Anand Prakash
6 min readDec 19, 2022

I have personally worked with 2 authors of the book — Vishal Pathak and Noritaka Sekiyama when writing the AWS blog — Ingest streaming data to Apache Hudi tables using AWS Glue and Apache Hudi DeltaStreamer and it was a great experience. So getting an opportunity to review their book was very exciting. Special thanks to Nivedita Singh and PacktPub for this opportunity to review the book.

Chapter 1 — Data Management — Introduction and Concepts

This chapter introduces different data management concepts, tools and services. It starts with discussing OLTP, OLAP and their differences. It then moves to talking about traditional data management systems — Data warehouse, Data Marts and then more modern systems — Data Lakes, Data Lakehouse, Data Mesh. The chapter also introduces Apache Spark framework and AWS Glue which can be used to execute Spark workloads.

Chapter 2 — Introduction to Important AWS Glue Features

The chapter 2 of the book gives a good introduction to various AWS Glue features/microservices. It goes in depth discussing AWS Glue Data Catalog, AWS Glue crawlers, AWS Glue Schema Registry, Glue Development endpoints, Glue Interactive sessions, Triggers. It also covers important key features of various Glue microservices — Data Sampling, incremental crawling, GlueContext, DynamicFrame, Job Bookmarks, GlueParquet format. The DynamicFrame section talks about the difference between Glue DynamicFrame and Spark DataFrame, reasons why for some aggregation operations Spark DataFrame performs better than Glue DynamicFrame.

Chapter 3 — Data Ingestion

Chapter 3 of the book explores Glue ETL jobs, Schema Registry, and custom/marketplace connectors. It covers how you can leverage AWS Glue ETL for data Ingestion from file/object store, JDBC store, streaming data source such as Amazon Kinesis and Apache Kafka, and SaaS data store. Authors also discuss AWS Glue specific ETL transformations and extensions — Job Bookmarks, Grouping, S3ListImplementation, and Bounded execution for data ingestion from Amazon S3. The section of data ingestion from JDBC source covers the OOM errors scenario, and explores how to implement parallel JDBC reads from AWS Glue using DynamicFrame.

Chapter 4 — Data Preparation

This chapter talks about AWS Glue DataBrew — visual data preparation tool, and AWS Glue Studio — new visual interface to author Glue ETL jobs. It covers some of the most commonly used transformations in AWS Glue ETL — ApplyMapping, Relationalize, Join, RenameField, Unbox, ErrosAsDynamicFrame.

Chapter 5 — Data Layouts

Chapter 5 of the book goes in depth discussing how to design data layouts — optimally store data, manage number of files and file sizes, and optimize storage on Amazon S3 — to maximize query performance. It goes in detail discussing compression, splitable or unsplittable files, partitioning technique to store data, Glue Partition indexes, Bucketing, Amazon S3 Lifecycle management. It covers the partition pruning in AWS Glue and the support for both client side and server-side pushdown for data filtering using Glue DynamicFrame. The chapter also discusses the various functionalities which AWS Glue provides to exclude S3 storage class, transition storage class, and purge objects in ETL jobs.

Chapter 6 — Data Management

As the title says, this chapter is all about managing the data. It covers techniques to cast data type, map column names, flatten nested schema, etc. for normalizing data in AWS Glue ETL job. The chapter also discusses techniques to handle error records, duplicate records, de-normalizing tables, masking and hashing data in Glue ETL job. Toward the end of the chapter, you will learn managing data quality using AWS Glue DataBrew data quality rules and DeeQu.

Chapter 7 — Metadata Management

This chapter goes in depth of AWS Glue Crawler discussing — Crawler behavior, lifecycle, configuration, custom classifiers, scheduling, automation, incremental crawling, S3-event based crawling. The chapter also touch on Lake formation governed tables and data lineage using Glue DataBrew.

Chapter 8 — Data Security

Security is “job zero” and this whole chapter is dedicated to approaches and configuration to ensure security of data lake and data pipelines. The chapter talks in depth about resource-based, and identity-based policies for AWS Glue catalog, moving on to tag-based access control for various resources in Glue. It covers Lake formation tag-based access control, and encryptions — at rest, in transit. You will also learn how Glue Network works.

Chapter 9 — Data Sharing

The chapter starts with discussing three different strategies to share data — single tenant, Hub and Spoke, and Data mesh. You will learn in depth with examples on how to share data with multiple AWS accounts using S3 bucket policies, Glue Catalog policies. If you haven’t used AWS Lake Formation and want to learn how to use AWS Lake Formation Tag-based access control to share data between AWS accounts then this is the go-to chapter. It has been explained very well with all code snippets and screenshots for you.

Chapter 10 — Data Pipeline Management

In chapter 10, you will learn what are data pipelines, how do you select data processing services for your use-case, how can you orchestrate the pipeline with workflow tools, and how to automate data pipeline provisioning using tools such as CloudFormation and AWS Glue Blueprints. The chapter discusses various AWS services — AWS Lambda, AWS Glue, AWS Batch, Amazon EMR, Amazon ECS, Amazon Athena — which one to select for data processing, which one shines for which use-case. It also covers orchestration services — AWS Glue workflow, AWS Step Function and Amazon Managed Workflows for Apache Airflow in detail.

Chapter 11 — Monitoring

This is the shortest but one of the most important chapters of the book. It sets the base for monitoring and troubleshooting. The chapter talks about overall monitoring of the data platform, listing the key topics — monitoring overall statistics, state changes, performance, failures, etc. — to monitor for AWS Glue ETL job.

Chapter 12 — Tuning, Debugging, and Troubleshooting

If you are running any AWS Glue workloads, then it’s a must-read chapter. You will learn to tune AWS Glue Crawler, tune performance of AWS Glue Spark ETL jobs, and get knowledge on how to troubleshoot and debug AWS Glue ETL jobs. The chapter covers the common issues related to AWS Glue jobs — OOM, disk space issue, too many tasks, Amazon S3 slow down error, etc. and its resolution.

Chapter 13 — Data Analysis

Here you will learn various tools and services used for data analysis. The chapter has a CloudFormation template which create 12 Glue jobs, an Amazon Redshift Cluster, Amazon MSK cluster, and an OpenSearch domain to demonstrate the integration of AWS Glue features/resource with various services which can help in data analysis. It also covers creating and updating Hudi tables, Delta Lake tables with Glue and inserting data into Lake Formation’s Governed tables. It is a heavy chapter but super useful.

Chapter 14 — Machine Learning Integration

Chapter 14 is again a short chapter, discussing the Glue ML transform — FindMatches which helps find duplicate records within dataset automatically. It covers how to create and use this ML transform. It also provides overview of Glue integration with AWS SageMaker service.

Chapter 15 — Architecting Data Lakes for Real-World Scenario and Edge Cases

This is a very interesting chapter of the book and it covers very practical examples of the real-world scenario, for example — solving join problems involving big fact and dimension tables using AWS Glue. It provides a lots of tips and tricks that can help you deal with some of the performance problems you may be facing in your AWS Glue jobs. Per me, this again is must-read chapter.

Conclusion

The book — Serverless ETL and Analytics with AWS Glue — is one of the most comprehensive book on AWS Glue that I have found. The book has 15 chapters and is over 400 pages and all the content is very relevant. If you are a novice or even an experienced AWS Glue user, there is a lot to learn from this book. The book has lots of code snippets, examples, and references to help understand the concepts. Another great aspect of the book is that it not only covers AWS Glue service, but also talks about AWS Glue DataBrew, AWS Lake Formation, ML use-case and some really good, practical, real-world scenarios of tuning , troubleshooting, and maximizing performance of ETL jobs in detail. It was super fun reading it. I have personally learned new things about AWS Glue service from this book. To conclude, I highly recommend reading this book!

Why wait, go get it — http://packt.link/CH3PC

If you have enjoyed reading this blogpost and it has helped you then you can buy me a coffee to help me write more and share more! :)

The viewpoints expressed are mine and not of my employer. The views, opinions and comments expressed by visitors are theirs alone and may not reflect my opinion. If you enjoyed reading this article feel free to connect with me on LinkedIn. If you’re new to Medium, sign up using my Membership Referral.

--

--

Anand Prakash

Avid learner of technology solutions around Machine Learning, Big-Data, Databases. 5x AWS Certified | 5x Oracle Certified.