Nowadays data quality metrics are crucial for the assessment of the overall health of a business. Why? Simply because bad or low-quality data can dramatically impact productivity and overall ROI of an organization.
If you are working in the business intelligence and data analysis world, you should know how to measure data quality, which dimensions to use and what are the standards and best practices for the measurement.
On this page:
- What is data quality? Why is it so important for business?
- Key data quality metrics/dimensions to assess and assure high data quality.
- An infographic in PDF for free download.
What is data quality? Why do You need to care about it?
Data quality refers to the overall utility of the data.
To operate productively, businesses need their data to be easily processed and analyzed!
Now, let see the definition:
Data quality is a kind of measurement of the adequacy and usefulness of a given data sets from different perspectives. In the business world, data need to be high quality in order to be used as a basis for business intelligence and for making business decisions.
Data reliability is a hot topic nowadays. To be a data reliable, it must measure highly in a lot of dimensions and metrics, including accuracy, consistency, completeness, and timeliness.
Importance and Benefits
Business organizations struggle to manage the new flood of constantly coming technologies and information. Their ability to adapt and to get the best of those areas dependent on their ability to get data management in the right way.
When considering the business benefits of a high data quality, the key goal is to make your business more profitable and successful.
Let’s see the top benefits your organization can gain from understanding and maintaining high data quality management.
- Better decision making: The better the data accuracy, the more confidence managers will have in the lowering risk in the increasing efficiency processes. Unreliable data lead to less confident decisions that can often be reasons for a variety of mistakes.
- More effective marketing: Data accuracy has a vital role in a number of marketing processes from marketing research to defining market segments examples. The wealth of market information available today enables experts to be more tightly focused on and more able to achieve the defined goals.
- Enhance analytical processes productivity: Data quality metrics and tools can guarantee you only trusted and accurate data is used for decision-making, which, in turn, increases productivity and confidence in business intelligence and business analysis processes.
- Improved customer satisfaction: Nowadays, customer service is a very data-oriented process with a crucial need of up-to-date and validated information. You know that today customers expect the best personalized experience possible. The better the quality of your data, the easier it will be for you to understand your customers and to give them the best personalized approach they want.
- Reduced costs: This point has many dimensions. Thanks to the data reliability, your business can complete more projects in less time, have better credit control and billing and etc.
As you see, high-quality data can benefit businesses from all industries and sectors.
Key data quality metrics and dimensions
To assess and describe the quality of the data in your company, you need specific data quality metrics.
Here are defined the best practice and dimensions, you need to make a reliable assessment.
As a core data quality metric, completeness includes determining whether or not each data entry is “full” and complete.
Completeness means you are sure that there are no missing records and that no records have missing data parts. All available data entry fields are completed and this is a must. Completeness indicates if there is enough information to come up with conclusions.
When it comes to survey sampling:
- The entire missing records are famous as unit nonresponse.
- The missing items are known as item nonresponse.
Both of these dimensions show a lack of quality.
In many databases (for example insurance and credit databases), missing entire records can have enormous consequences.
What to do when completeness problems occur?
The best practice is to carefully examine the processes that create the database.
Common reasons you can find out when examining are:
- Some employee are not well educated how to do the data entries in the software and need additional training.
- The software is hard to use.
- Particular procedures for updating the database are in error.
What is the completeness unit of measure?
Percentage – the proportion of stored data against the likelihood of “100% complete”. You need to set business rules which identify what “100% complete” represents.
When managing completeness, it’s vital that critical data (such as phone numbers, customer names, email addresses and etc.) are completed first. Non-critical data might be filled later as it does not impact completeness that much.
Do your data accurately represent the “real world” values?
Accuracy is the degree to which data correctly describes the real world person or thing that is identified by it.
Accuracy applies to whether the data values stored for a person or object are the correct values. To be accurate, data values must be the right value and must be represented in an unambiguous shape.
Example: Address of a student in the student database is the real address. Incorrect spellings of addresses can impact the analytical process in a very negative way.
It is a simple example but in a variety of data driven processes, a high degree of accuracy can make a huge difference. It is especially true when it comes to long periods of time when even small errors can lead to critical inefficiencies. Ensuring data accuracy is not an easy process.
What can reflect the data accuracy?
There are many factors that can affect the accuracy so it’s important to find out them. Here are the most common three components and factors:
- Data Decay. Data can begin as accurate, but with time passing, become inaccurate due to some reasons. For example, telephone numbers can be changed.
- Manual Entry. Errors from manual entry occur when users enter the wrong value.
- Data Movement. This error appears when moving data from one system to another.
Perfectly the accuracy is established through primary research. However, as this is an expensive and not practical way, the more common way is using 3rd party reference data from trustworthy sources.
What is the accuracy unit of measure?
The percentage of data entries that pass the predefined data accuracy rules.
This is one of the data quality metrics which will allow you to ensure that there no data duplicates reported.
Uniqueness is a dimension that means you are sure no thing will be recorded more than once. Of course, you need to define “the thing” first. Uniqueness metric ensures that each data record is unique and minimize the risk of examining outdated information.
You may have in your database two customers – John Peterson and Jonathan Peterson. In fact, they are the same person, but Jonathan Peterson record has the latest details.
What is the risk in this situation?
The risk is that a Sales Representative may have outdated information for John Peterson. It means the Sales Person will not be able to make a contact with the client.
Reasons that reflect data uniqueness might be data decay, data movement, or just outdated information.
What is the uniqueness unit of measure?
Percentage. The number of things as assessed in the ‘real world’ compared to the number of records in the database.
How can you define the real world things?
The most common ways are:
- To determine them from an other more reliable data set
- To determine them with a relevant external comparator.
Some experts consider timeliness as the most important data quality dimension. And it is a crucial element of database management and assessment.
Timeliness is the degree to which data present the reality from a given point in time. This metric shows whether the information is available when it is needed.
Timeliness refers to:
- The ability to understand what you need and when.
- Data available without delay.
- Smooth information flow.
You know that data should be available and up to date to meet adequately different management and business decision needs.
There are two core aspects of timeliness:
- Frequency: Data must be available frequently enough to meet the decision making needs. You need to define the appropriate time intervals of collecting data.
- Currency: The data should be current which means enough up to date to be beneficial in decision-making.
What is the timeliness unit of measure?
Time. It measures time difference.
You can also use the following metric:
The number of records with at least one attribute not updated in time (per time unit) /The Number of records with at least one attribute in time.
Is all the data relevant to your organization needs and expectation? This question is what the relevance metric is all about.
The data in your organization should be relevant to its needs.
Why is relevance important dimension?
Because of it:
- Helps you to know what you want.
- Gives you the ability to use data with maximum efficiency.
- Saves your time from operating with data you do not need.
Here are 2 important points of the Relevance dimension:
- Do the data meet the fundamental needs for which they were collected?
- Can the data be used for additional purposes (e.g., a market sizing process)?
What is the relevance unit of measure?
% of entries distinguished by the users as no used.
The above data quality metrics are the most common in terms of data quality assessment (DQA). However, there are aslo other dimensions that might be very important in different businesses and organizations:
- Referential Integrity
The importance of high-quality data for different types of business all over the world grows constantly. As an expert in the data management field, you need to make sure the data are“Fit for Use” in their decision-making and other roles.
Unfortunately, the data of a huge number of organizations still do not meet data quality metrics and criteria. As the cost of computer storage has decreased over the last years, the number of databases has increased enormously.
More about the business impacts of poor data quality, you can find in the David Loshin paper where he studies the cost classifications related to data quality assessment.
Nowadays with the wide variety and availability of statistical software and many data analyst experts, there is a good base for analyzing the databases in-depth.