Background

Our client is a lead generation startup which aims to accelerate the time to connect general contractors with relevant construction projects. To provide this service in real-time, up-to-date information about construction projects – from early development to completion – is collected & utilized in an in-house matching algorithm. Hence, leads are generated with minimal time spent on manual efforts to search for relevant projects.   

Problem Statement

Our client collects documents from several sources, often in the form of a pdf. In these documents, formatting may differ from year-to-year, therefore a custom text extraction method is necessary to parse through the text and isolate the most relevant information. Therefore, Beam Data was tasked with identifying an OCR tool which was accurate in isolating only relevant text and could be integrated into our client’s cloud infrastructure.

Methodology

To tackle the described tasks, it was important to understand our client’s current data architecture & pipelines so that solution integration was seamless. For instance, as AWS was our client’s platform of choice, we explored a native AWS OCR tool – Textract – which contained a query functionality for response retrieval alongside AWS lambda and step functions for automation.

Architecture

The first schematic outlines the key steps involved in extracting data from a document using AWS Textract and the available storage options. Utilization of AWS textract could then be wrapped in an automation pipeline via AWS lambda and step functions in order to execute the pipeline and handle errors.  

Conclusion

In this engagement, the Beam Data team utilized their expertise in AWS in order create an automated text retrieval pipeline. This was performed by customizing the query functionality of AWS Textract to better cater to the documents of interest. Moreover, failure of the queries were handled in the lambda and step function pipeline.