Extracting Data points as JSON from Handwritten forms using — AWS Textract, Lambda and S3

5 min readApr 16, 2021

The winter storms in Texas in 2021 have made our client come up with a use case that helps them collect data from users through handwritten paper form, and save the data points to our existing database tables.

As developers, this was a new challenge and an opportunity to learn and try new technology stacks.

After a good round of research, we finalized to use the OCR service provided by AWS called “Textract”.

When we were working on it we couldn’t find many resources which could guide us to work with these services, so I on behalf of our team thought of documenting and publishing our journey…

Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

In this post we will be covering the following checkpoints:

1. How to upload image/document to S3 bucket

2. How to create a Lambda function that gets triggered when a new image or doc is uploaded in S3

3. How to write python code in lambda function which will help in parsing the analyzed text/data from doc/image and provide a JSON output

4. Creating a new response JSON file whenever a new image/doc is uploaded to s3

Prerequisites

1. You need to have an AWS account and some basic knowledge of AWS services.

2. To use AWS Textract in Python, the latest “boto3” package is needed which is not currently available in AWS Lambda hosted environments. Execute this command: pip install — target ./python boto3

3. You should have created essential credentials like Access Key and Secret Key on your account.

4. Following AWS services will be utilized throughout this guide

Lambda Service
Textract Service
Simple Storage Service i.e. S3
Identity Access Management Service

1. Uploading Image/doc to S3 bucket

Create S3 Bucket

1. Go to AWS S3 page and click Create Bucket

2. Enter the Bucket name and Region click Next

3. Set permissions and click Create Bucket

Note:- The region selected should be the same as the one used to create the Lambda function

Upload image to S3 bucket

For this example, we are trying to upload images directly from the AWS S3 bucket console.

2. Extracting data from an S3 image

This process consists of a Lambda function that gets triggered whenever an image (.jpg extension) gets uploaded to the S3 bucket.

Follow the steps below

Creating the S3 Lambda Trigger

Steps to create a Lambda function that gets triggered whenever an image is uploaded

1. Go to AWS Lambda page and click Create function

2. Select “Use a Blueprint” and search for “s3-get-object-python” template and click “configure”

3. Enter the function name, role name

4. For creating the S3 trigger under the S3 trigger tab:

Add the bucket name
Select all object to create events for Event type
Suffix as .jpg

5. Click create function

Note:- We can add more than one triggers for PNG, PDF files based on our needs

Configuring the roles created

1. Once lambda function is created go to the configurations tab on the lambda function page

2. Choose the Permission tab where we can see the Role that we have created

3. Click on the Role name, this will open up the role’s summary page

4. Click attach policies and add “AmazonTextractFullAccess”, this will give the lambda function to access the AWS Textract service

Add Layers to Lambda function

Step 1- Creating the boto3 zip

In the local machine create a directory for the project

mkdir -p zip_boto3/python

2. Install boto3 package in to boto3/python directory

Pip install boto3

3. Zip the contents of boto3

cd zip_boto3zip -r boto3-layer.zip python

Step 2- Creating layer in AWS

1. Log in to AWS console and go to lambda

2. Select layers and click create a layer

3. Fill in the details for the layer and select Python 3.7 as the compatible runtime

Step 3-Using the layer

1. Select the lambda function that we have created

2. Click on Layers and then Add a layer

3. Pick the layer that we have created and save the lambda function

3. Add code to Lambda Function

Create two files in your function folder

lambda_function.py

This code will help to extract the data from the uploaded image and save the data as a JSON file

Note:- The above code snippet will help in capturing the table data from the uploaded file, and returns a JSON file with the table contents specified by the column name

trp.py

trp.py helps in parsing the data response that we get from AWS Textract

Save and test your Lambda function

4. Result

Whenever an object with .JPG extension is uploaded to the S3 Bucket the associated lambda function will get triggered and an output file with .JSON extension will be generated back to S3 Bucket.

5. References

Automatically extract text and structured data from documents with Amazon Textract | Amazon Web…

Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries…

aws.amazon.com

Working with object metadata

You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value…

docs.aws.amazon.com

Textract - Boto3 Docs 1.17.53 documentation

A low-level client representing Amazon Textract Amazon Textract detects and analyzes text in documents and converts it…

boto3.amazonaws.com

amazon-textract-response-parser

You can use Textract response parser library to easily parser JSON returned by Amazon Textract. Library parses JSON and…

pypi.org

Textract - Boto3 Docs 1.20.25 documentation

A low-level client representing Amazon Textract Amazon Textract detects and analyzes text in documents and converts it…

boto3.amazonaws.com

Hope the content was informative! :)

-Noufal Rijal

Extracting Data points as JSON from Handwritten forms using — AWS Textract, Lambda and S3

In this post we will be covering the following checkpoints:

Prerequisites

1. Uploading Image/doc to S3 bucket

Create S3 Bucket

Upload image to S3 bucket

2. Extracting data from an S3 image

3. Add code to Lambda Function

4. Result

5. References

Automatically extract text and structured data from documents with Amazon Textract | Amazon Web…

Documents are a primary tool for record keeping, communication, collaboration, and transactions across many industries…

Working with object metadata

You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value…

Textract - Boto3 Docs 1.17.53 documentation

A low-level client representing Amazon Textract Amazon Textract detects and analyzes text in documents and converts it…

amazon-textract-response-parser

You can use Textract response parser library to easily parser JSON returned by Amazon Textract. Library parses JSON and…

Textract - Boto3 Docs 1.20.25 documentation

A low-level client representing Amazon Textract Amazon Textract detects and analyzes text in documents and converts it…

Written by Noufal Rijal

Responses (2)