Artificial intelligence (AI) continues its rise to prominence within the business world. The number of companies using AI today and the range of problems AI is being applied to are both increasing steadily. However, there is one issue that is plaguing AI just as much as it has plagued analytics of all kinds over the years—data quality.
Data Quality Is A Familiar Nemesis
Organizations put tremendous resources behind ensuring the quality of their data. This is necessary due to the broad range of ways that data quality can be compromised. Users might input data incorrectly, a system setting might lead to an incorrect code being assigned to certain actions, or a typo might end up in a script developed to facilitate data transformation. These are among the many potential sources of poor data quality.
The reality is that the data quality issue will never be “solved” no matter how much an organization budgets or how sincere its intentions may be. This is because the business environment—and the systems supporting it—is always in flux. New quality issues can arise at any time from any number of directions, including immediately after we certify that the data is pristine at a given point in time.
AI Doesn’t Escape Data Quality Issues
As organizations delve into AI, data quality will be as big of an issue as ever. This is because any AI processes that use traditional data sources will be just as dependent on those sources being of high quality as any other analytic processes. However, AI is also making use of a wide range of new data sources and data types. The methods of the past that are typically borrowed for a new data source fall apart when entering the realm of new data types used by AI.
Data such as images, text and videos have not been used to any significant degree in the past using non-AI methodologies. The data quality issues with these data types are also different than those of the past. Let’s take the example of images and consider a few ways that different data quality issues come into play.
• Data quality for model building. On the input side, images are often “tagged” to facilitate building a model. For example, a picture will be tagged as “containing a cat” or not, “containing a hot dog” or not, etc. Humans do this tagging, and humans can make errors. How do we find those errors and correct them? It can be easy to automate flagging that a price is clearly too low, an invoice is too high or an age can’t be true. With image tagging, it is very difficult to find an error without having a second person look for it and correct it. Detecting tagging errors mathematically is incredibly difficult, if not impossible, today.
• Data quality for model scoring. Assuming there is a good model to identify hot dogs or cats, then new images can be passed to the model to determine if the pictures contain a hot dog or a cat. However, how do we identify if a picture is too blurry to tell as opposed to being a clean picture that simply doesn’t have the objects of interest? It isn’t an easy task to identify which pictures are “clean enough” for valid analysis. The answer can also vary based on the sophistication of the model being used.
The point here is that standard data quality methods such as outlier detection, missing value imputation and invalid value correction simply don’t apply to images, text and audio data. These data types have unique characteristics and unique usages compared to traditional structured data and, therefore, require some serious attention in terms of how to assess and enforce data quality.
Addressing AI’s Data Quality Challenges
The analysis of many new types of data such as images is still new enough that nobody has it all figured out, so you’ll need to do some research and experimentation. The first step is to task your data quality team to research what others have published with respect to data quality and AI. There are approaches to data quality described in academic journals—as well as industry publications—to explore, learn from and then implement.
The next step is to implement the best data quality procedures you can today; however, plan to update those capabilities as advances in addressing data quality for AI are made. Consider making use of a services-based approach to make the incorporation of a new data quality check simple. For example, have a current process simply pass each new image to an additional image screening routine once it becomes available. With this approach, you can build a foundation for AI data quality one piece at a time.
On an ongoing basis, have your AI team spend extra time monitoring the performance of the AI models they’ve deployed to look for patterns related to misclassifications and errors. Once identified, the team can create new data quality checks to address them. For example, if it is found that images with a blue background have a much higher error rate, then work to tune the algorithms to better handle blue backgrounds.
Just like with other data types and methods, data quality will be an ever-present concern for AI processes and the data that feeds them. However, that is no reason to hold off pursuing AI today. Recognize the challenge that data quality will pose, take the actions described here, and anticipate that whatever you implement will only get better over time as your data quality processes mature.
Forbes Communications Council is an invitation-only community for executives in successful public relations, media strategy, creative and advertising agencies. Do I qualify?
Original post: https://www.forbes.com/sites/forbescommunicationscouncil/2022/11/03/data-quality-is-also-an-ai-problem/?sh=659cfaaf261b&s=09