Data is unquestionably the central piece in any enterprise’s journey to AI. Bad data not only results in sub-optimal models but also in opportunity loss and generating far less value from the data. Data preparation in an AI life cycle has been called out as the most time (80%) consuming step. One of the reasons for it being the most time-consuming step is that it is an iterative debugging process. The figure below (left) shows a data science lifecycle in theory. However, the diagram on the right shows a more realistic version of the data science lifecycle. It is an iterative cycle as the challenges in data are not known at the start of the data science lifecycle but discovered during the process. In addition, the data also goes through multiple personas like Data Steward, Data Engineer, Quality Analyst, AI Scientist, Governance Officer, Business SME, etc before it can be used for building production level models.
The two observations from the above discussion are – the need for high-quality data for AI and the complex and intermingled data life cycle calls for developing a framework with the following two core characteristics:
- A rich algorithmic layer that includes new data quality assessment metrics like class imbalance, class overlap, etc, and also corresponding remediation algorithms to improve the quality.
- A mechanism to communicate Data Quality, Remediation, and other data specific information with all stakeholders to enable collaborative data improvement, getting a view into data quality much earlier in the process while avoiding duplicate work.
We are building algorithms that assess data along various dimensions and bring out the challenges in the data. To make the assessment actionable, we are also investing in data explainers to point to the parts of the data responsible for the low quality and methods or guidance to fix the data. We focus on a variety of data modalities including tabular, text, and time-stamped logs to build highly scalable and robust algorithms packaged as a toolkit. This toolkit provides support for the first desired characteristic.
To assist in collaboration, we propose Data Assessment and Readiness Report which is attached to each data set in the enterprise. The report will include:
- User-provided qualitative description of data
- Data Distribution Profiles
- Data Quality Profiles
- History of committed remediation
Readers coming from the data governance background will relate to this report as metadata. We also emphasize the need to maintain the history of data transformations to a) allows the user to see the evolution of the data quality and choose the version of interest and b) to comply with audit and governance requirements.
This information is automatically appended in this report as the data is getting transformed through our algorithmic framework. One important thing to note is that the user can on-demand generate the report for any version of the data since we store the base data profile and the series of transformations.
To learn more about this exciting area of research, please join us in our three-part series of Data Assessment and Readiness techniques. The first one is on Friday, August 7, 2 pm-4 pm. Please do register at the following link: https://www.meetup.com/IBMDevConnect-Bangalore/events/272247488/
We will update this post with the dates of the next part. Later this month, we will be presenting a shorter version of this series at ACM KDD 2020 Tutorial on “Overview and Importance of Data Quality for Machine Learning Tasks”
Hope to see you at one of our sessions!