Scan Dynamodb table from AWS Glue in different account

In this blog post I will list down the steps required to setup the AWS Glue job to scan the dynamodb table in another account. In my setup, I scan the dynamodb table in Account A (us-west-2), perform glue transformations in Account B (us-east-1) and write it to S3 in Account B.

Account A — Dynamodb table
Account B — AWS Glue Job, S3 Bucket

1. Create an IAM role in Account B (us-east-1) with Glue trusted entity, attach AWSGlueServiceRole service policy and attach another policy allowing sts and s3 action. Make note of the role arn.

arn:aws:iam::xxxxxxxxxxxx:role/ddb-to-s3-glue-role

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"sts:AssumeRole",
"sts:GetAccessKeyInfo",
"sts:GetSessionToken",
"sts:TagSession"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"s3:DeleteObject",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}

2. Create IAM role in Account A (us-east-1) where Dynamodb table resides to allow “Scan” action.

arn:aws:iam::xxxxxxxxxxxx:role/scan-ddb-role

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:DescribeTable",
"dynamodb:Scan"
],
"Resource": [
"arn:aws:dynamodb:us-west-2:xxxxxxxxxxxx:table/service-statement"
]
}
]
}

Make sure to modify the Trust relationship, allowing the role created in step 1 to allow assume role and the access conditions for the role.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::xxxxxxxxxxxx:role/ddb-to-s3-glue-role"
},
"Action": "sts:AssumeRole",
"Condition": {}
}
]
}

3. Create AWS Glue job in Account B (us-east-1). Choose the IAM role created in step 1 as the role to be assumed by the job.

4. Create Glue script to scan the table.

import sys
import json
import boto3
import requests
from time import time
from awsglue.job import Job
from datetime import datetime
import pyspark.sql.types as t
from awsglue.transforms import *
import pyspark.sql.functions as f
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import Relationalize
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
#Instantiate sts client
sts_client = boto3.client('sts',region_name='us-west-2')
#Assume the role created in step 2.
assumed_role_object=sts_client.assume_role(RoleArn="arn:aws:iam::xxxxxxxxxxxx:role/scan-ddb-role", RoleSessionName="AssumeRoleSession1")
#Retrieve credentials
credentials=assumed_role_object['Credentials']
#Instantiate dynamodb client
dynamodb_client = boto3.resource(
'dynamodb',
aws_access_key_id=credentials['AccessKeyId'],
aws_secret_access_key=credentials['SecretAccessKey'],
aws_session_token=credentials['SessionToken'],
region_name='us-west-2'
)
#Function to scan the dynamodb table
def scan_table(table_name, filter_key=None, filter_value=None):
"""
Perform a scan operation on table.
Can specify filter_key (col name) and its value to be filtered.
This gets all pages of results. Returns list of items.
"""
table = dynamodb_client.Table(table_name)
if filter_key and filter_value:
filtering_exp = Key(filter_key).eq(filter_value)
response = table.scan(FilterExpression=filtering_exp)
else:
response = table.scan()
items = response['Items']
while True:
#print(len(response['Items']))
if response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
items += response['Items']
else:
break
return items#Variable to store the result of scan. Return list of dictionaries.
table_items = scan_table(table_name='service-statement')

Reference -
https://martinapugliese.github.io/interacting-with-a-dynamodb-via-boto3/