Good data in, great decisions out.

samuel-zeller

Artificial Intelligence and data governance

The adoption of artificial intelligence is rapidly spreading across many businesses. This technology is driving constant improvements in the decision-making processes and overall performance across a large variety of industries.  It is also helping to better understand customer needs, improve service quality, predict and prevent risks.
The implementation of a proper data governance framework is essential to enable organizations to fully unlock the potential of their data. This post explains what data governance is and why it’s relevant to artificial intelligence.

Data governance consists of the set of procedures designed to properly manage data.  Appropriate policies must guarantee the availability, usability, integrity, and security of enterprise data. In machine learning, data governance procedures ensure that  all the interested stakeholders across the enterprise have always access to high-quality data.

Scales of data quality

In machine learning, just as in computer science,  the saying “garbage in, garbage out” holds true. This means that even the most advanced machine learning model will perform poorly when fed with low-quality data. So, how would one get to assess data quality before it is actually used? A data quality assessment process starts by defining a list of data dimensions.  Data dimensions are features of the original data that can be measured against pre-defined  standards. Some of the most common data dimensions are:

Accuracy. It measures how reliable a dataset is by comparing it against a known, trustworthy reference data set. If refers to a single data field and it usually relates to the number of outliers caused by database failures, sensors malfunctions, wrong data collection strategies, and so on

Timeliness. It is the time delay from data generation and acquisition to utilization. Data that is used later than when it was collected might be obsolete or no longer reflecting the physical phenomena it is explaining

Completeness. It refers to the percentage of available data or equivalently, the absence of missing values

Consistency.  Data is consistent when the same data  located in different storage areas can be considered equivalent, (equivalent can have several meanings from perfect match to semantic similarity).

Integrity. High-integrity data conforms to the syntax (format, type, range) of its definition provided by e.g. a data model

For a more detailed and comprehensive discussion about data dimensions, refer to [1] and [2].

Measuring the right things

The dimensions  monitored  vary depending on business requirements, processes, users, etc. For example, in social media data, timeliness and accuracy are probably the most important quality features. However, since social media data are usually unstructured, consistency and integrity might not be suitable for evaluation. For biological data instead, data storage software and data formats are very heterogeneous. Thus, consistency might not be the most appropriate metric as a quality dimension.

Once data and dimensions to be monitored have been selected, it is important to define a baseline of values or ranges representing good and bad quality data, that is the quality rules the data needs to be assessed against. Moreover, each dimension will have different weighting which determines how much it contributes to the data quality as a whole. How rigorous the rules need to be and how to choose the aforementioned weighting for each data dimension pretty much depends on the impact that the single organization put on the monitoring phase.

For instance, one may easily agree with the fact that incorrect or missing email addresses would have a significant impact on marketing campaigns. In this case, one would put very low thresholds on the tolerated number of missing records and high weighting on completeness and accuracy. The aforementioned low threshold would in fact minimize the number of missing emails while the high weighting would guarantee that the existing records are available and reliable.
The same applies to inaccurate personal details that may lead to missed sales opportunities or to an increase in complaints from customers.

Keeping data quality high

Once this preparation phase has been completed, the process enters the data acquisition phase, followed by the data monitoring stage. The latter consists mainly of a data quality assessment process and an issue resolution process.

data quality assessment
Schema of Data Quality Assessment phases

The Data Quality Assessment process, can either produce a data quality report at regular intervals or a continuous recording of the  quality scores. Such scores, stored in a database,   would help in tracking data quality over time.

Monitoring and solving issues

The issue resolution process enables either people or automatic software tools to flag issues and to systematically investigate and resolve them. Of course, the more informative such logs are, the more efficient the resolution of the data quality problems will be. As suggested in [3],  certain information should be always included in all logs.

Each issue should have a unique identifier. Sequential numbers are good identifiers because they tell the number of issues identified so far.  Good ideas for statistical information are: grouping issues into categories and recording the opening and  fixing dates of problems.   The latter allows to compute the average issue resolution time and compare it with a target).  Logging the person who has raised an issue  eases reporting progress and agreeing on action plans.

Logs must also include the data owners, who are responsible for investigating and fixing issues related to the data they own.  An informative log helps estimating the impact of a problem and prioritising efforts for problem resolution correctly. 

In summary

Data is appropriate for consumption by third parties and for building machine learning models only after  it has passed quality control without significant problems.

At Amethix we dedicate a large part of our time to assessing data quality.  We adopt a continuous approach:  from the early stages,  data collection, cleaning and transformation up to  data integration and model design, we monitor data closely. This ‘whole pipeline‘ approach  speeds up the development and debugging of our models and ensures top performance. The data scientists following our strategy know exactly what to improve when a model is not performing as expected.

Of course, the real value of data lies in the support it gives to the decision processes of an organisation. Any enterprise aiming to adopt artificial intelligence for their processes should implement a data governance framework to ensure the quality of their data.

It is essential to deal with data governance for your organisation. Because good data always leads to great decisions.

References

[1] The Challenges of Data Quality and Data Quality Assessment in the Big Data Era 
[2] The six primary dimensions for data quality assessment
[3] What do you include in a data quality issue log

Subscribe to our Newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *