Real-time Data Streaming Pipeline Optimization

Background

Our client is providing advanced agriculture tools and digital information to farmers to become more profitable. The company utilizes sensor solutions and provides real-time and actionable insights. It also provides farmers with the power to control their operating costs. Their product is a solution that saves farms over $20,000 annually by improving energy efficiency and reducing machine maintenance through predictive analytics.

The main service that the Beam Data team provided to them was on these two parts:

  1. Comprehensive data streaming pipeline optimization
  2. Real-time data visualization using Quickset

The new proposed pipeline turns out to be way more efficient and functional in terms of the massive amount of data collection, visualization, and in-time notifications in communicating with end users.

Problem Statement

The Client uses AWS as the main cloud provider. They use Kinesis Firehose and AWS Lambda to transform and store the data the devices collect. The data is served to the client’s app via RDS and Dynamo DB. The app provides some time-series analytics, energy consumption and cost associated with it.

However, with the pressure of increasing amount of real-time data collection and its in-time analysis, the client wanted to update the pipeline infrastructure to make it more robust, reliable, and scalable. The current pipeline randomly breaks, takes a long time to process data for frontend users, DynamoDB has a rate limit. A few changes were proposed to the client by the Beam Data team to improve the pipeline reliability and scalability.

Tools used: AWS (IoT Core, Kinesis Data Firehose, Kinesis data Analytics, S3, Lambda, DynamoDB, API Gateway, SNS, Athena, Quickset)

Challenges

The current pipeline is quite sophisticated and took some time to understand the data get transformed and consumed by the end-users. The infrastructure has a loosely coupled structure that needed a detailed overview and complete understanding in the entire data flow.

Original way of data collection and storage

During the overview of the pipeline, a few flaws were discovered and patched immediately. A few pipeline design changes were proposed by the Beam Data team to improve reliability and reduce the cost of the infrastructure.

Key results

We discovered that there were a few glue crawlers running every hour on buckets related to some devices. These crawlers contributed to the extensive infrastructure cost increase. It was recommended to pause the crawlers and enable glue metadata registry on the Kinesis level. This approach significantly saves the time and amount of idle tasks and makes it less reliable on glue crawlers.

We proposed a few designed changes, one of the most suitable method is to use Athena and QuickSight for data analytics and data visualization (See appendix for dashboard that the Beam Data team created).

Proposed pipeline to address issues with DynamoDB and provide visualization to end-users.

The team also add one step of pre-aggregating the data per minute base instead of saving each data point (per second base) from each device using Kinesis Analytics. This should result in less intensive computation of some statistics and prove cost-saving benefits. We also recommended differentiating devices per type which allows streamlining the process of deploying new devices.

In addition to aggregation, we deployed a pre-trained anomaly detection model provided by Amazon that is built on Random Cut Forest algorithm. This extra functionally will output an anomaly score for each appliance that the device is connected to, the lambda function checks for any abnormal score and notify the user via text message using SNS service.

Prototype pipeline to aggregate data and detect anomalies

Conclusion

The proposed ideas on the data infrastructure has been test out to significantly reduce the cost of infrastructure and make the pipeline more resilient. By taking this opportunity, Beam Data gained consulting experiences in smart agriculture industry which implement the application of IoT solutions.

Appendix: Dashboard of the real-time voltage usage (demo)

The client has a subscription-based app targeted at farmers where they can loginand visualize key information related to their energy expenditure. This information is collected by sensors provided by the client and store in AWS S3 and AWS DynamoDB.

The client requested the creation of QuickSight Dashboards templates which could provide valuable KPIs and metrics to their customers about their energy expenditure. Among these metrics, the client mentioned during our first contact it would be nice to have predictions and forecasts included.

Explore more