Executive Summary
WeCloudData is one of the fastest growing Data & AI training companies in the world. Since 2016, WeCloudData has trained and helped thousands of students and clients level up their data skills and mature their data organizations. As organizations continue to undergo digital transformations all over the world, enterprises are experiencing pains that come with the complete digitalization of a business. How do users find relevant content quickly and seamlessly within their workflow? How can content search be simplified and intuitive? WeCloudData is helping clients reinvent content search in their business by combining modern data engineering pipelines with sophisticated machine learning models deployed in the cloud and improving knowledge search capabilities while maximizing ROI.
Situation
As enterprises continue to digitize and consume data by the petabytes and exabytes, business units and technical staff experience friction and barriers when it comes to searching for content and knowledge across the organization. There’s information overload and an overwhelming volume of knowledge content scattered throughout business systems and across the internet. Data anywhere, everywhere, all the time. These business users and tech staff have the following critical requirements when it comes to knowledge and content search:
- The search platform must be able to extract data and information from multiple sources and data types – internal and external. This speaks to the search breadth capability.
- The search platform must be able to drill deep into content and pull information that is relevant and accurate based on the query keywords and parameters. This speaks to the search depth capability.
- The search tool must understand the context of the user and adapt such that the top results returned account for the user’s department, job function, previous search history and predict search needs and intents. This refers to the tool’s AI capabilities.
The volume and variety of content that needs to be scanned, manipulated and processed requires a data architecture and platform that is robust, scalable and automated. As business units become more specialized, business functional knowledge and content also become siloed, detached and incongruent. Hence, the content search solution must be an integrated platform that pulls content from disparate sources into a unified data store exposing the data for further processing and machine learning. This mechanism allows for opportunities to reveal previously unseen connections between content and business functions.
Resolution
To build an integrated AI content search platform, Beam Data helped the client deploy a multi-stage data and machine learning pipeline:
- Content is ingested from multiple sources across the business (internal) as well as relevant external sources via API’s and webhooks into a central data lake
- The raw content is processed with Spark in Databricks
- The refined data is indexed and stored in Elasticsearch and Postgres databases
- Data from the databases are pulled into Databricks for Spark machine learning model training
- The machine learning models are deployed to the cloud and powers the content search tool accessed by end-users
- The end-to-end pipeline is automated and orchestrated with Apache Airflow
Architecture
The search app is highly available and scalable because the entire data and machine learning pipeline is built on the cloud. Furthermore, this architecture is flexible and efficient due to its modularity and the automation with Airflow. More machine learning models can be added and replaced if needed and microservices can be plugged into or out of the ecosystem as necessary.
Conclusion
Beam Data helped the client build a highly available and scalable integrated AI content search platform to help business users and tech staff find the relevant answers they need quickly. The seamless search experience integrates enterprise knowledge and content and helps users explore new connections between information. The team automated the data and machine learning pipeline with Apache Airflow and used the powerful Spark engine on Databricks to process the data and train machine learning models. The team will continue to improve this platform by adding MLflow and DevOps tools & techniques.