Data Quality At The Source Pattern — What???

Data Quality
at the Source
Pattern —
What???

Jacek Chmiel

Director of Avenga Labs

November 5, 2020

7 min read

Data and AI

Data quality improvement is a goal, always

The business value of data is becoming even more important to all the commercial and public organizations as they continue their investments of capturing data better, analyzing them better and faster, and making more accurate predictions and decision automation.

Most of the attention is on the advanced machine learning techniques that help to find patterns and recognize trends which would have been missed by traditional techniques; I’ll not even mention humans.

→Explore Avenga Data and Data Science Expertise

Quite a lot of attention is paid to the new technologies enabling real time data analysis as well as using on-the-fly algorithms dealing with data streams and their orchestration.

There is less attention on data quality testing; which is not our case, fortunately.

But, relatively little attention is paid to the sources of data which usually are the transactional systems and external APIs of your business partners.

Therefore calling ‘data quality at the source’ as a new pattern is not an absurdity. The first reaction is surprise. How could anyone call it a pattern, it’s just part of every software delivery process? Really?

Not only the data team is responsible for data integrity

One of the sad realities of data processing is the dependence on the quality of the source data. The very popular quote “garbage in, garbage out” is the insensitive but quite accurate way to describe it in a few words.

Actually, it’s not as bad as it used to be years ago because the problem of data quality management is widely recognized and new methods and tools emerged to help with that.

There are techniques of manual, semi-automated and automated data verification, data augmentation, and data cleansing. They continue to work better, new patterns are discovered, and there are new enhancements to the existing toolkit to help data people fix more data quality issues.

Yet still, one truth doesn’t change: the better the source data quality, the better the entire data processing pipeline, from ingestion mechanisms to data lakes to advanced machine learning based analytics.

The data quality of transactional systems is so obvious that it’s… often forgotten and not even measured at this early initial stage from the data processing pipelines perspective.

Microservices: products vs. data quality

In the microservices architecture, which is the most popular architecture paradigm at this time for transactional applications, and according to our research and other sources, all the services should be responsible for their own data integrity and quality.

Microservice level

So this is the first step: each microservice should ensure that the data it contains in its stores are consistent and conforming to the defined business rules.

Unfortunately, even this first step is often neglected and it starts with the lack of definition, like what the proper data is, schemas, and business validation rules, and then with the actual implementation. Even in 2020, I was able to find systems in which the validation of data input was done only in the front-end layer (not ours, I swear!) or was limited to just a simple “required/non-required” or only number/date formats. And, without even checking the simple rules on the entity level, such as: if the birth date was not in the future or if the age of the person was not a negative number.

→ Read more about Microservices at the front: micro front-ends

Product level

The next step is to ensure data consistency and thus quality on the whole application level. And with loosely coupled architectures, long running business transactions that replace the ACID integrity of old times without checking the data integrity in the system as a whole result in hard to manage inconsistencies.

There will be data duplication (which is bad), the problem of stale data, and the problem of data conflicts; when two different services have the same business entity, conflict resolution, compensation of failed business operations, etc.

Which data store is the master store for the given type of data? How to synchronize? How to perform conflict resolutions? How to test for the inconsistencies? Which of those matter to the business and which can be tolerated? How to add this data sanity checks for CI/CD pipelines?

Without answers to those questions, and the answers are not easy to come by, ensuring data quality at the source is not possible.

Lean culture

Many engineering teams like to talk about how they love and follow lean principles.

One of the key lean principles for manufacturing is quality at the source. In the case of general manufacturing, it means all the production sites should take care of and be fully responsible for their own quality. This includes testing and quality data measurement at the beginning of the manufacturing process, and not just at the end. In the case of software, it also means taking care and responsibility of the data quality management from the beginning.

The true spirit of full cycle development and developers is also about paying attention to the data quality at the source and later, as well as at every stage of the data processing. Application developers can not say that data quality issues are something that belongs to the data team as they are supposed to be part of the solution, from the first line of code they write to the very last.

Wait a second, you’re right, it’s a culture thing, again.

Additionally, data mesh is another important trend on the strategic level.

Yes, this article is not about data first, it’s about software development process and culture.

Why not?

Visible cost of data quality “savings” vs. invisible cost of fixing the problems

Paying attention to the data quality at the source compared to ignoring it is of course more expensive. The data needs to be described, data validation rules and tests have to be created and maintained, and it is necessary that bug fixes have to be implemented.

It’s a visible cost.

In case you skip it, the cost of fixing it later in the data ingestion pipeline or in the ML model will be very expensive. Often the correlation between the cost, which occurs later, and those direct costs, in the here and now, is not easily visible thus there’s a lack of true understanding of the consequences which is due to the lack of financial numbers.

Cost of losing the trust of the consumers

Imagine that the wrong data is used to make your business decision. Very often it’s not the internal problem of the data consumer, but the sometimes forgotten fact that the key consumers of your data is … your clients!

When they see the wrong data in the transactional system or in the reports, it may be very late to fix those bugs and certainly too late to build the trust relationship with the customer.

→Explore why Essentially, Data is good. It’s the use cases that can be problematic

Future of data integrity at the source pattern

Of course, step zero is always to increase the awareness of the problem.

What seems at the first impression as something so obvious that it’s not even worth mentioning, appears to be often underestimated or entirely forgotten.

There are known and proven solutions to his problem. And then, there are rewards of better data analytics, better predictive models, and happy customers.

And what is even better, you don’t have to do it alone! As a group of ambitious experts, Avenga can help you establish effective solutions for data quality, starting with the data quality at the source.