Poor data quality is hurting artificial intelligence (AI) and machine learning (ML) initiatives. This problem affects companies of every size from small businesses and startups to giants like Google. Unpacking data quality issues often reveals a very human cause.
More than ever, companies are data-rich, but turning all of that data into value has proven to be challenging. The automation that AI and ML provide has been widely seen as a solution to dealing with the complex nature of real-world data, and companies have rushed to take advantage of it to supercharge their businesses. That rush, however, has led to an epidemic of sloppy upstream data analysis.
Once an automation pipeline is built, its algorithms do most of the work with little to no update to the data collection process. However, creating those pipelines isn’t a one-and-done task. The underlying data must be explored and analyzed over time to spot shifting patterns that erode the performance of even the most sophisticated pipelines.
The good news is that data teams can curtail the risk of erosion, but it takes some serious effort. To maintain effective automation pipelines, exploratory data analysis (EDA) must be regularly conducted to ensure that nothing goes wrong.
What is exploratory data analysis?
EDA is one of the first steps to successful AL and ML. Before you even start thinking about algorithms, you need to understand the data. What happens in this phase will determine the course of the automation that takes place downstream. When done correctly, EDA will help you identify unwanted patterns and noise in the data and enable you to choose the right algorithms to leverage.
In the EDA phase, you need to be actively inquiring about the data to ensure it’s going to behave as expected. As a start, below are 10 important questions to ask for a thorough analysis:
- Are there enough data points?
- Are the measures of centers and spreads similar to what was expected?
- How many of the data points are good and actually usable for analysis?
- Are there any missing values? Are bad values a significant portion of the data?
- What is the empirical distribution of the data? Is the data normally distributed?
- Are there distinctive clusters or groups of values?
- Are there outliers? How should the outliers be treated?
- Are there any correlations between the dimensions?
- Is any data transformation needed to reformat the data for downstream analysis and interpretation?
- If the data is high-dimensional, can this dimensionality be reduced without too much information loss? Are some dimensions mostly noise?
These questions may lead to additional questions and even more after that. Don’t think of this as a checklist but as a jumping off point. And at the end of this process, you will be armed with a better understanding of the data patterns. You can then process the data correctly and choose the most appropriate algorithms to solve your problem.
The underlying data is constantly changing, which means that a significant amount of time must be spent on EDA to make sure that the input features to your algorithms are consistent. For example, Airbnb found that nearly 70% of the time a data scientist spends on developing models is allocated toward data collection and feature engineering, which requires extensive data analysis to ascertain the structures and patterns. In short, if a company does not invest the time to understand its data, its AI and ML initiatives can easily spin out of control.
Let’s look at an example from companies that have used data exploration effectively to develop and build successful data products.
The only constant is change
One of the most important aspects of digital services is cybersecurity and fraud detection, now a market valued at more than $30 billion and projected to reach more than $100 billion by the end of the decade. While there are tools such as Amazon Fraud Detector and PayPal’s Fraud Management Filters for general detection of online fraud, the only constant in fraud detection is that fraud patterns are always changing. Companies are constantly trying to stay prepared for new kinds of fraud while fraudsters are trying to innovate to get ahead.
Every new kind of fraud may have a novel data pattern. For instance, new user sign-ups and transactions may be coming from an unexpected ZIP code at a rapid rate. While new users may come from anywhere, it would be suspicious if a ZIP code that was previously very quiet suddenly started screaming. The more difficult part of this calculus would be knowing how to flag a fraud transaction versus a normal transaction that occurred in that ZIP code.
AI technologies can definitely be applied to find a model for fraud detection here, though you as the data scientist must first tell the underlying algorithm which sign-ups and subsequent transactions are normal and which ones are fraud. This can only be done by searching through the data using statistical techniques. You dissect the customer base to ascertain what distinguishes the normal customers from the fraudsters. Next, you would identify information that can help categorize these groups. Details may include sign-up information, transactions made, customer age, income, name, etc. You may also want to exclude information that would introduce significant noise into the downstream modeling steps; flagging a valid transaction as fraud could do more damage to your customer experience and product than the fraud itself.
The frustrating (or fun, depending who you ask) part is that this EDA process must be repeated for all products throughout their life cycles. New fraudulent activities mean new data patterns. Ultimately, companies must invest the time and energy into doing EDA so that they can come up with the best fraud detection features to maintain their AI and ML pipelines.
Understanding the data is the key to AI and ML success, not a vast repertoire of algorithms.
In fact, businesses can easily fail when they force their data to fit their AI and ML pipelines rather than the other way around.
Henry Li is Senior Data Scientist at Bigeye.