I want to start this blog with some of my personal notes before jumping on the book.
2. What is a norm?
A norm is a measure of distance and have three properties
For quite some time I have been thinking to start writing about data science and machine learning, but somehow it wasn’t happening. Recently, I joined 25 weeks long fastbook reading session organized by Weights & Biases and lead by Aman Arora ( awesome instructor 👏). This is my first time participating in a reading session and it has been amazing experience. There is so much participation during and after the lecture that it makes learning fun
In this blog post I discuss how to export 100GB non-partitioned table from Aurora PostgreSQL to Amazon S3. I will walk you through two approaches that you can use to export the data. Firstly I will demonstrate using
aws_s3, a PostgreSQL extension which Aurora PostgreSQL provides and then using
AWS Glue service. The post also covers the performance and scaling challenges when exporting the table using AWS Glue.
I was looking for a way to enable a user to upload bunch of small size (< 10 MB) csv files to a specific S3 bucket. Exploring the ways I came across AWS Amplify service and I felt AWS Amplify makes it easy to deploy a full-stack web app for someone like me who does not have much knowledge of web app development. …
Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having missing values in Pandas DataFrame, handling these missing values in Pandas and converting the Pandas DataFrame to Spark DataFrame.
To give a quick background, I was writing a data transformation (ETL) job in AWS Glue using PySpark which was to be executed every 15mins. The…
AWS Glue is a serverless ETL service to process large amount of datasets from various sources for analytics and data processing. Recently I came across “CSV data source does not support map data type” error for a newly created glue job. In a nutshell, the job was performing below steps:
And it was during this write step that the glue job was failing. Lets look into it in little more details -
datasource0 = glueContext.create_dynamic_frame_from_options(…
In this blog post I will discuss following scenarios to connect to databases from AWS Lambda function:
In this setup, Amazon Aurora PostgreSQL database is running in private subnet with public accessibility set to No. The connectivity and security detail are as follows:
Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions.
Recently, AWS Glue service team has added a new feature (or say parameter for Glue job) using which you can immediately view the newly created partitions in Glue Data Catalog.
To demo this, I will pre-create an empty partitioned table using Amazon Athena Service with target location to S3. I have another S3 location which…
AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. When you add a AWS Glue job, you can choose the job to be either Spark or Spark Streaming or Python shell type.
For one of my use-case I wanted to try the new “ redshift-data “ api service in AWS Glue Python Shell script. The Amazon Redshift Data API can be used to run SQL queries on Amazon Redshift tables. To test the redshift-data API I wrote a simple AWS Glue Python Shell…
Every time with AWS re:Invent around, AWS releases many new features over a period of month. In this blog post I will touch on 3 new features which were introduced for Amazon DynamoDB. DynamoDB is a non-relational managed database with single digit millisecond performance at any scale.
New Features in Amazon DynamoDB -
Avid learner of technology solutions around databases, big-data, Machine Learning. 5x AWS Certified | 5x Oracle Certified. Connect on Twitter @anandp86