Extracting Data points as JSON from Handwritten forms using — AWS Textract, Lambda and S3
The winter storms in Texas in 2021 have made our client come up with a use case that helps them collect data from users through handwritten paper form, and save the data points to our existing database tables.
As developers, this was a new challenge and an opportunity to learn and try new technology stacks.
After a good round of research, we finalized to use the OCR service provided by AWS called “Textract”.
When we were working on it we couldn’t find many resources which could guide us to work with these services, so I on behalf of our team thought of documenting and publishing our journey…
Amazon Textract is a machine learning service that automatically extracts text, handwriting, and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.
In this post we will be covering the following checkpoints:
1. How to upload image/document to S3 bucket
2. How to create a Lambda function that gets triggered when a new image or doc is uploaded in S3
3. How to write python code in lambda function which will help in parsing the analyzed text/data from doc/image and provide a JSON output
4. Creating a new response JSON file whenever a new image/doc is uploaded to s3
Prerequisites
1. You need to have an AWS account and some basic knowledge of AWS services.
2. To use AWS Textract in Python, the latest “boto3” package is needed which is not currently available in AWS Lambda hosted environments. Execute this command: pip install — target ./python boto3
3. You should have created essential credentials like Access Key and Secret Key on your account.
4. Following AWS services will be utilized throughout this guide
- Lambda Service
- Textract Service
- Simple Storage Service i.e. S3
- Identity Access Management Service
1. Uploading Image/doc to S3 bucket
Create S3 Bucket
1. Go to AWS S3 page and click Create Bucket
2. Enter the Bucket name and Region click Next
3. Set permissions and click Create Bucket
Note:- The region selected should be the same as the one used to create the Lambda function
Upload image to S3 bucket
For this example, we are trying to upload images directly from the AWS S3 bucket console.
2. Extracting data from an S3 image
This process consists of a Lambda function that gets triggered whenever an image (.jpg extension) gets uploaded to the S3 bucket.
Follow the steps below
Creating the S3 Lambda Trigger
Steps to create a Lambda function that gets triggered whenever an image is uploaded
1. Go to AWS Lambda page and click Create function
2. Select “Use a Blueprint” and search for “s3-get-object-python” template and click “configure”
3. Enter the function name, role name
4. For creating the S3 trigger under the S3 trigger tab:
- Add the bucket name
- Select all object to create events for Event type
- Suffix as .jpg
5. Click create function
Note:- We can add more than one triggers for PNG, PDF files based on our needs
Configuring the roles created
1. Once lambda function is created go to the configurations tab on the lambda function page
2. Choose the Permission tab where we can see the Role that we have created
3. Click on the Role name, this will open up the role’s summary page
4. Click attach policies and add “AmazonTextractFullAccess”, this will give the lambda function to access the AWS Textract service
Add Layers to Lambda function
Step 1- Creating the boto3 zip
- In the local machine create a directory for the project
mkdir -p zip_boto3/python
2. Install boto3 package in to boto3/python directory
Pip install boto3
3. Zip the contents of boto3
cd zip_boto3zip -r boto3-layer.zip python
Step 2- Creating layer in AWS
1. Log in to AWS console and go to lambda
2. Select layers and click create a layer
3. Fill in the details for the layer and select Python 3.7 as the compatible runtime
Step 3-Using the layer
1. Select the lambda function that we have created
2. Click on Layers and then Add a layer
3. Pick the layer that we have created and save the lambda function
3. Add code to Lambda Function
Create two files in your function folder
lambda_function.py
Note:- The above code snippet will help in capturing the table data from the uploaded file, and returns a JSON file with the table contents specified by the column name
trp.py
Save and test your Lambda function
4. Result
Whenever an object with .JPG extension is uploaded to the S3 Bucket the associated lambda function will get triggered and an output file with .JSON extension will be generated back to S3 Bucket.
5. References
Hope the content was informative! :)
-Noufal Rijal