The adoption of artificial intelligence is rapidly spreading across many businesses. This disruptive technology is driving consistent improvements of the operational efficiencies and decision-making processes across a large variety of industries, and it is helping to better understand customer needs, improve service quality, predict and prevent risks, just to mention a few.
In this realm, the implementation of a proper data governance framework becomes fundamental to enable organizations to fully unlock the potential of their data. Generally speaking, data governance consists of the set of procedures to provide the management of the availability, usability, integrity, and security of data used in an enterprise. More specifically to machine learning, data governance procedures ensure that high-quality data are available to all the stakeholders across the enterprise, making sure that the purpose of such accessibility is always available.
In machine learning, as much as it holds in computer science, “garbage in, garbage out”. This means that even the most sophisticated and fancy machine learning model would perform poorly if it is fed with low-quality data. So, how would one get to assess data quality before it is actually used? A generic data quality assessment process starts by defining a list of data dimensions, which are nothing more than characteristics of the original data and that can be measured against pre-defined baseline standards. Here is a summary of some of the most common ones:
• Accuracy. It measures how reliable a dataset is by comparing it against a known, trustworthy reference data set. If refers to a single data field and it is usually related to the number of outliers caused by database failures, sensors malfunctions, wrong data collection strategies, and so on
• Timeliness. It is defined as the time delay from data generation and acquisition to utilization. Data that is used later than when it was collected might be obsolete or no longer reflecting the physical phenomena it is explaining
• Completeness. It refers to the percentage of available data or equivalently, the absence of missing values
• Consistency. It usually means that the same data that are located in different storage areas should be considered equivalent, where equivalent can have several meanings from perfect match to semantic similarity
• Integrity. It means that data conforms to the syntax (format, type, range) of its definition provided by e.g. a data model
For a more detailed and comprehensive discussion about data dimensions, refer to “The Challenges of Data Quality and Data Quality Assessment in the Big Data Era” and “The six primary dimensions for data quality assessment”. It goes without saying that the dimensions that are usually monitored can vary depending on business requirements, processes, users, etc.
For example, for social media data, timeliness and accuracy are probably the most important quality features. However, since social media data are usually unstructured, consistency and integrity might not be suitable for evaluation. For biological data instead, data storage software and data formats are very heterogeneous. Thus, consistency might not be the most appropriate metric as a quality dimension.
Once data and dimensions to be monitored have been selected, it is important to define a baseline of values or ranges representing good and bad quality data, that is the quality rules the data needs to be assessed against. Moreover, each dimension will have different weighting which determines how much it contributes to the data quality as a whole. How rigorous the rules need to be and how to choose the aforementioned weighting for each data dimension pretty much depends on the impact that the single organization put on the monitoring phase.
For instance, one may easily agree with the fact that incorrect or missing email addresses would have a significant impact on marketing campaigns. In this case, one would put very low thresholds on the tolerated number of missing records and high weighting on completeness and accuracy. The aforementioned low threshold would in fact minimize the number of missing emails while the high weighting would guarantee that the existing records are available and reliable.
The same applies to inaccurate personal details that may lead to missed sales opportunities or to an increase in complaints from customers.
Once this preparation phase has been completed, the process enters the data acquisition phase, followed by the data monitoring stage. The latter consists mainly of a data quality assessment process and an issue resolution process.
In the Data Quality Assessment process, either a data quality report is generated at regular intervals or a continuous recording of the data quality score may be produced and then stored in a database. The latter strategy would help in tracking data quality over time.
The issue resolution process enables either people or automatic software tools to flag issues and to systematically investigate and resolve them. Of course, the more informative such logs are, the more efficient the resolution of the data quality problems will be. As suggested in the article “What do you include in a data quality issue log”, regardless of the business, certain information should be included in all logs. First of all, each issue should have a unique identifier. Using sequential numbers as identifiers has the additional advantage of providing an instant picture of how many issues have been identified so far. Information such as the date when an issue has been raised and when it has been fixed, as well as a categorization of each issue, are all important factors because they allow computing statistics like the average issue resolution time and its comparison with a target resolution time, just to mention one. Logging the person who has raised an issue is also necessary in order to keep track of whom to report progress to and agree on remedial action plans. Data owners, who are responsible for investigating and fixing issues related to the data they own, are also included in such logs.
The estimated impact of the issue within the organization represents a critical element too because it allows prioritizing the efforts required for investigation and resolution.
Only when data quality is assessed as good and no significant issues are detected, data can finally be consumed by third parties and considered for building machine learning models in order to solve specific business problems.
At Amethix we dedicate a large part of the typical machine learning pipeline to assessing data quality. Our monitoring strategy takes place during the early phases of the pipeline, such as data collection, cleaning and transformation and is propaedeutic to subsequent phases like data integration and model design. This not only encourages data scientists to consider exclusively high-quality data for their models but also speeds up both the development and debugging of the entire machine learning pipeline. As a matter of fact, the data scientists following our strategy know exactly what needs to be improved when a model is not performing as expected.
It comes without saying that the real value of data is connected to how much support it provides to the decision processes of an organization. It is a must for any enterprise aiming to adopt artificial intelligence for their processes, to implement a data governance framework that monitors and improves the quality of their data. Because good data always leads to great decisions.
The Challenges of Data Quality and Data Quality Assessment in the Big Data Era
The six primary dimensions for data quality assessment
What do you include in a data quality issue log