4 Data Quality Considerations in Peer Benchmarking Software

4 Data Quality Considerations in Peer Benchmarking Software

baromite-feature-image-template-large-blog-dataquality
baromite-feature-image-template-large-blog-dataquality

Data is the lifeblood of modern businesses, yet high quality data is scarce, and even harder to replicate once found. In data businesses, quality is one of the most critical operational KPIs and at the fore-front of every conversation with internal and external stakeholders. 

For peer benchmarking products, which straddle the line between Software-as-a-Service (SaaS) and Data-as-a-Service (DaaS), data quality is paramount. If you’re in the world of peer benchmarking, or just generally looking for data quality strategy, here are four considerations when building your operations to ensure you catch as many errors as possible before they reach your customer.

1. What defines Data Quality?

Most quality control procedures are designed to identify, isolate, and correct outliers. Without diving too deeply into the statistical definition, outliers are any data point which don’t fit into the standard pattern of data. Errors, on the other hand, are data points which are incorrect, either due to a system error or user input (we all make typos).

Errors may be outliers, but not all outliers are errors. 

Whether you’re in a SaaS model where your data is generated from user-supplied content or product meta-data, or in a DaaS model where your data is the product and created for commercial purposes, understanding these 3 basic levels of data quality review are critical.

Formats

This is a fundamental data quality target in any business. When operating scaled ETL processes, it is critical that data is formatted correctly and matches your product needs. If a particular format is required for calculation, only that data type should pass the quality control test. 

This target can and should be handled in the software or ingest layer of your products. It generally an easy data quality decision as well – data either matches the required format or it is an error.

Deltas

For benchmarking products like Baromitr where the same data is produced over a series of time-based intervals, delta analysis is key to identifying data quality problems. This type of quality target compares the same data point over time and isolates large swings in value for quality review. 

Delta analysis relies heavily on accuracy in the previously submitted time period(s). If you’re building processes/products and cannot be confident that the data from the first several periods will be accurate, do not rely on delta targets.

Distributions 

Distribution targets measure the spread (or dispersion) of data across multiple records collected within the same time interval. These quality targets are the most common in data businesses, and rely on established statistical principles to find and isolate quality control problems. Different processes rely on different measures of spread, but most quality analysis starts with one of these two measures.

Attribution: Diva Jain / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Median/Quartiles: medians are the middle point of a data distribution, so that 50% of the data is greater than the median and 50% of the data is less than the median. The quartiles cut those 50% in half, registering the points at which 25% and 75% of the data is greater than the quartiles. Medians minimize the impact of outliers on analytics, but are more limited in their extension to additional analysis. 

Use Median and Quartile targets if data outliers are typically valid data points, expected, and significant to your data product. 

Mean/Standard Deviations: means are the average of all data points in your dataset. Standard deviations are a statistical measure of spread, and are usually measured based on their relative distance from the mean (+/- 1, 2, or 3 standard deviations). Means can be highly biased by outliers, but are required for most other types of statistical analysis. 

Use Mean and Standard Deviation targets if data outliers are typically real errors and if more advanced statistical review is required by your product.

2. Where to assess Data Quality

Data businesses rely on data pipelines – these refer to the mechanisms within which data flows through your business. Like on a manufacturing production line, data quality tests should be instituted at all stages of the pipeline to ensure the finished data product is of the highest quality. 

But where exactly should these be built in a benchmarking solution?

The first place to add data quality controls is at the beginning of your pipeline, starting with raw  user input. This is the most effective way to ensure high quality data flows through your pipeline, but it has the greatest aggregate impact on data production deadlines. It’s also the least likely to allow real errors to reach your product.

Unstructured input is the bane of data quality. If you can, build an ingest mechanism which leverages a standardized taxonomy or ontology to minimize data variance. If you don’t have an established taxonomy yet, you can use a service like Classr to build one yourself and integrate it directly into your pipeline.

Another standard location for data quality controls lies in the pre-production review stage. For peer benchmarking solutions like Baromitr, this is handled at the data submission and review stage. Quality control at this stage is ideal, but it is also the point where data errors are most likely to be overlooked by users. 

Baromitr’s pre-production review for data quality from a Member’s perspective.

The most predominantly-used location for data quality tests lies in post-production review. This is the final stopping point for data before it is presented to your customers. Generally the least-effective point in a scaled business, it is the easiest to implement and has the lowest impact on overall data deadlines. Unfortunately, this is also the testing point at which most errors will slip through into your final product. 

3. Who Should Test My Data Quality?

The question of where to place data quality controls is also dictated by who or what will be running these tests. Many businesses rely on a mix of manual and automated mechanisms, each of which have their own pros and cons.

Manual quality control is much faster to kick off – it simply requires a person with access to the data and capacity to determine whether a data point is an error. Unfortunately, at scale the manual tests are much less efficient, and they require significant and progressive domain expertise on the part of the quality control team.

If your business suffers from high turnover or rapidly changing data requirements, reliance on manual quality control poses a significant risk. 

Automated quality control, based on machine learning and AI techniques, is much more expensive and time-consuming to design and execute. However, it is much faster than manual quality control tests over time, and is infinitely more scalable. It also requires no continued domain expertise. This strategy is all about ROI – expensive and difficult to build, but it regularly outperforms manual methods as time goes on.

4. How Much Quality Testing Should I Do?

This is the most difficult question to answer, and likely the one where you and your stakeholders will spend the most time. In a world with unlimited resources and time, every record data point would be perfect and without any question of accuracy. Unfortunately, real businesses have to make trade-offs.

If you’re benchmarking in areas of high sensitivity or scrutiny like government, financial, and health data, you have many incentives to ensure the highest level of data quality. This high data quality will come with costs in time or financial outlay, so choose the level of data scrutiny that catches as much as possible within time and budgetary restraints. In other sectors, data quality processes must be balanced against business requirements. If your business operates on a monthly production cycle, standard quality control measures should not extend data delivery past the monthly cycle. 


If you’re in the process of building benchmarking, or if you’re here because you’re looking for a general framework on data quality in DaaS and SaaS products, these four considerations will be a strong starting point for you to establish your own data quality control process. 

If you’d like to get started building a high-quality data benchmarking solution but aren’t sure how to start, or if you’re interested in learning more about how we measure data quality in Baromitr, feel free to contact us

Sam Giffin
Sam Giffin
Baromitr CDO and Co-Founder

Interested in other Blog Updates?