Handling Vast Amounts Of Data With Data Analytics

How data
analytics
handle massive
amounts
of data

Valentyn Zubenko

Data Engineering Director

Dmytro Voloshyniuk

Senior Data Engineer

February 21, 2023

10 min read

Data and AI

With the right handling, data analytics can be a game changer in business intelligence.

Allied Market Research states that the data processing market is booming and a trending topic. By 2031, it is expected to reach $5 billion, compared to its current worth of $650 million. Which is why, there is a great deal of information about handling data through different processing frameworks. To get a clearer picture, let’s look at some ideas and confirmed cases of how data processing works in practice.

In a nutshell, there are many cases where customers want to build or structure their data ecosystem. This is done by following the consumption of the data by business users which can be accomplished through direct access to data or through reports. And to be honest, there is no wrong decision. However, there are instances when, at the beginning of a project, it may only cover some use cases and architecture’s essential requirements and these are crucial to building a data ecosystem that meets the customer’s expectations.

From our experience, we have often found that using the cloud approach for building data analytics platforms is more effective. It includes Artificial Intelligence (AI) and Machine Learning (ML) components. Today we’ll describe how the Azure Data Platform can align a customer’s expectations.

Problem

When using the selected platform, we faced two problems:

First, we had to store data from multiple sources in one place with the ability to expose this data to other consumers.
Second, we had to map GitHub data with JIRA data and create visual metrics in order to see the performance of the developers.

We did not have many requirements for the system, but some were discovered:

The system should live on the Azure Cloud

The system’s cost should be as low as possible
The visualization tool is Power BI (Business Intelligence)

Considering the above, our case revolved around dealing with data storage and the JIRA-GitHub connection. These two aspects proved vital for tapping into the selected data processing capabilities. Yet, to get a clearer picture, it is also crucial to give some context.

Background

The essential thing in our journey was to organize the data storage. Our previous experience indicated that we could use the column store RDBMS to store data in staging. Then, we could transform and structure it. Many new data types exist, like JSON and blob files, however, the classic RDBMS needs to meet the important requirements of the architecture. As a result, we found that storing this data type is not optimal in RDBMS.

There is rich functionality in working with JSON in PostgreSQL and we wanted to point this out. This DB engine would also be the correct choice if other data types are unavailable. In our case, we had requirements to store blob files (images, video from Twitter, etc.). One critical requirement was using Azure, because it has blob storage where we could store our data. Also, we could connect many other services with the blob storage. It helped to query this data using SQL (Structured Query Language) with special services. At this point, we needed the right data engineer who could use the Synapse and Databricks ecosystem to handle the problems noted above.

Synapse ecosystem

Azure has its cloud data ecosystem and it is Synapse Analytics, which includes the following:

Data pipeline

Data lake

Dedicated SQL pool

Other services are available in Azure as well. For instance, there are Analysis Services, AI\ML, and Power BI (Business Intelligence). You can join Azure’s functions, logic apps, etc., to meet your requirements. In our case, we could easily replicate the experience of implementing the analytic system on Synapse from a previous project. Also in our case, we connected more than 10 data sources with the help of Data Factory. Next, we loaded them into the data lake. Then, we extracted query data from the tables in a dedicated or serverless SQL pool using SQL; the dedicated SQL pool has static tables in the raw layer, while the serverless SQL poll has external tables. It grants the ability to query data through the management studio or the OLEDB data provider (this was one of the requirements).

When using Azure Synapse Analytics, we have the lowest cost for the necessary services. Additionally,, we have the power of a Dedicated SQL pool with all the functionality of Azure DWH. It also means that you can pause this service and start it back up if the power is no longer needed. I won’t describe all the benefits of all these services; if you need something clarified, you can open Microsoft documentation and read all the necessary information.

Returning to the case, here is one of the reference architectures based on the Azure Data ecosystem. (see Fig.1).Figure 1. Architecture based on the Azure Data ecosystem

The figure above shows the entire process through extraction from data sources toward orchestration, analysis, and visualization. As a part of the solution, the second part of dealing with the problems above relied on the Databricks ecosystem.

Databricks ecosystem

The Databricks ecosystem provides a unified set of tools for data engineers and data scientists. It helps build data and ML pipelines. The tool allows developers to execute various Spark commands in a convenient notebook-like format. The Databricks platform architecture is composed of two primary parts:

The infrastructure used by Databricks
Customer-owned infrastructure

That is why Databricks workloads are cloud-agnostic. Yet, at the same time, the service allows deeper integration with other cloud services. For example, in the current implementation, we coupled Databricks notebooks with other Azure services, such as Data Factory, ADLS, Azure Monitor, Event Hubs and Active Directory.

In addition, the Azure service offers a simple UI so you can choose the best cluster type for different workloads. Along with data pipelines, our team has implemented several ML notebooks. The data science team used pandas’ built-in API on Spark, which automatically converts pandas’ commands to Spark analogs.

Depending on the workload type, the team may peek at memory-optimized clusters for memory-intensive transformations, or they can also look at ML runtime clusters with additional ML libs installed. The GPU integration or short-term job clusters are optimal for cost-savings. Databricks manages and deploys cloud infrastructure on its behalf.

Building a data lake

For building a modern data lake, our team embraced the lakehouse approach. The data was split into three layers based on format and applied transformations.

Silver and gold layers used the delta lake open-source library. This provided an effective structure for ad-hoc querying and reporting purposes. Built on the foundation of another open-source format, Parquet, delta lake adds advanced features and capabilities that enable additional robustness, speed, versioning, and data-warehouse-like ACID compliance (atomicity, consistency, isolation and durability).

As a result, our team built a low-cost solution where storage entirely relied upon the Azure Data Lake Gen2 service. Additionally, we could limit computing billings by stopping dev clusters and using job clusters for scheduled ETL (extract, transform and load) workloads.

Learn more about how we delivered data processing solutions, allowing our clients to have a deeper understanding of their customers. Ayasdi: Machine Learning for Data Processing

Connecting JIRA and GitHub information

Our initial goal was to connect two data sources, JIRA and GitHub, and organize them in a data lake. Later on, the transformed data will be used to develop reports showing the dependency between the work of developers in GitHub and closed tickets in JIRA; Project Manager should see a dashboard of the team’s performance. After finishing this scope of work, we extended our data sources a couple of times:

First, we connected GitLab as an additional source to GitHub because some projects store code there, which gave us the ability to connect more projects to the dashboard.

Second, we connected the social sources, Twitter and Telegram. These were important in order to grab the necessary information for building an interactive map showing news with this message’s geographical location. To do that, we applied NLP (Natural Language Processing) to recognize the city name when it is mentioned and to map it to a geographical location. Also, we used Databricks auto-loader and streaming so as to decrease the time of when the data becomes available for analytics.

The result

We built a four-layer data lake to store raw and prepared data: Raw, Cleansed, Curated, and Laboratory. Laboratory is an extra layer that combines parts from the first three layers. It is used for self-analytics and Machine Learning. There are no specific rules for this layer, as the data is stored in different formats according to the business needs.

During development, we used different approaches to store and process data in the data lake. These aspects were involved:

We avoided using a metadata file to determine which file was processed and which was not. We agreed to use two folders for processing, as we did not need extra metadata files. We just processed a file and moved it to the archive folder, where the raw layer stores data in the original format. Some pipelines rely on a built-in auto-loader functionality for tracking increments.
We aligned the data format on the second layer and used the delta lake format for all sources. Using this, we were able to standardize all types of data and use a great feature of the data format that allowed us to track the history of changes.
We stored combined data from our sources into one data model. We used the curated layer for consumption by the BI tool or application.
We had one more temporary layer, laboratory, for different purposes, like applying experiments and giving access to the data science team.
The data science team created various ML experiments, which were reproducible, and the MLFlow tool handled essential ML-ops processes.

For transformation logic, we used Python notebooks and ran them on the Databricks cluster. To track the execution of all the notebooks, we connected the Azure monitor, where we stored all the custom notifications. We had a dedicated notebook that saved logs, which we reused in other places. And for orchestration, we use Data Factory. Finally, to see if our processes were healthy and had no failed executions, we connected Power Bi directly to Azure Monitor and created a dashboard so as to see the system’s health. MLFlow capabilities were used for the ML processes performance metrics’ monitoring and models versioning.

What were the results of our work? We needed all these predefined rules and monitoring in order to achieve our goal. And lastly, we connected Power BI to visualize and see the connection between JIRA and Github (see Fig. 2).Figure 2. JIRA and Github connection

In the second iteration, we connected the social networks. Then, we grabbed the raw information and stored it in the raw layer of our architecture. Later, we transformed and kept it in the curated layer of the data model. Between these processes, we applied NLP in order to recognize the location mentioned in the message and extend our data model with geocoordinate attributes. Finally, we used the map to show news and historical mentions in a particular region in the identified country. Here you can see the results.

The bottom line

At first glance, the case above seems like an easy task. And with an appropriate architecture, achieving the results described above is possible. We also can extend our solution to meet new business needs, as each new case will have specific architecture requirements, use cases, and decisions. You can’t really just build an ideal one-time solution because each solution evolves and becomes stronger after several iterations.

Want to know more about how your data can handle a variety of business challenges? Our expert teams are ready to offer you the answers you might need. Contact us.