Machine learning for data cleaning and unification

The biggest problem data scientist face today is dirty data. When it comes to real world data, inaccurate and incomplete data are the norm rather than the exception. The root of the problem is at the source where data being recorded does not follow standard schemas or breaks integrity constraints. The result is that dirty data gets delivered downstream to systems like data marts where it is very difficult to clean and unify, thus making it unreliable to utilize for analytics.

Today data scientists often end up spending 60% of their time cleaning and unifying dirty data before they can apply any analytics or machine learning. Data cleaning is essentially the task of removing errors and anomalies or replacing observed values with true values from data to get more value in analytics. There are the traditional types of data cleaning like imputing missing data and data transformations and there also more complex data unification problems like deduplication and repairing integrity constraint violations. All of these are inter-related, and it is important to understand what they are.

Data cleaning and unification problems

Schema mapping looks at multiple structured data and figures out whether they are talking about the same thing in the same way. In the example below, does “building #” and “building code” both represent building number?

Schema Mapping

Record linkage is where multiple mentions of the same real-world entity appear across the data. The different formatting styles for each source leads to records that look different but in fact all refer to the same entity. In the example below all four table records are referring to the same medical lab.

Record Linkage & Deduplication

Missing data refers to values that are missing from a dataset. Missing value imputation is the process of replacing missing data with substituted values. In practice, the problem is more complicated because missing data is not represented by Nulls but instead by garbage, like in the example below.

Missing Data

Integrity constraints ensure that data follow functional dependencies and business rules that dictate what values can legally exist together. Deducing constraints from data can be very difficult, especially since most data relations are non-obvious. In the example below, Jane Smith is a building manager for both the Medical Lab and Management building which breaks the business rule. For the rule to hold, either Jane Smith is not the management for one of the two buildings or the medical lab is actually a management building, or the management building is a medical lab.

Integrity Constraint

Above we’ve seen a few of several data quality challenges. The problem is that most data scientists are employing rule-based tools and ETL scripts that handle each of the data quality issues in isolation. Whereas the fact is that most data, like the figure below, usually have all if not most data quality problems and they interact in complex ways. The problem is not just of tools being unable to handle interaction between data quality issues. The solutions don’t even scale well on large data sets due to high levels of computation and require multiple passes before enough corrections have been made.

Most datasets contain several data quality issues (source)

Machine learning for data cleaning and unification

Considering the issues with current solutions, the scientific community is advocating for machine learning solutions for data cleaning which consider all types of data quality issues in a holistic way and scale to large datasets.

Entity resolution is a good example of data unification task where machine learning is useful. The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization. At the core of deduplication, we want to eliminate duplicate copies of repeated data. With record linkage we aim to identify records that reference the same entity across different sources. And canonicalization is where we convert data with more than one representation into a standard form. These tasks can be described as a “classification” and “clustering” exercise where with enough training we can develop models to classify pairs of records as matches or non-matches and cluster pairs into groups to choose a golden record.

The figure below gives a visual representation of this process.

The deduplication process (source)

The python dedupe library is an example of a scalable ML solution for performing deduplication and record linkage across disparate structured datasets. To work effectively, dedupe relies on domain experts to label records as matches or non-matches. Domain experts are important because they are good at recognizing which fields are most likely to uniquely identify a record and they can judge what a canonical version of a record should look like.

Record repair is another use of ML in data cleaning, and an important component of unification projects. Repairing records is mainly about predicting the correct values of erroneous or missing attributes in a source data record. The key is to build a model that combines signals such as integrity constraints, external knowledge and quantitative statistics for probabilistic inference of missing or erroneous values.

The HoloClean system is one such example of an ML solution where the user provides a dataset to be cleaned and some high-level domain-specific information such as reference datasets, available rules and constraints and examples of similar entries within the database. The system then fixes errors ranging from conflicting and misspelled values to outliers and null entries. In the example below, the user can specify denial constraints (first-order logic rules that capture domain-expertise) such as City, State, Address -> Zip to identify that row 1 conflicts with information in row 2 and row 3.

Repairing Integrity Constrain Violations (source)

Providing scaleable ML solutions

Organizations using traditional approaches for data cleaning and unification have built many legacy operations including ETL scripts with business rules and domain-knowledge documentation. Ignoring all this work is the main reasons many new solutions do not work in improving legacy operations. A big ML challenge is how to ingest information from legacy operations to guide the learning process for ML solutions. One option is to use the legacy operations to build training data but there is a possibility that the data may be biased and noisy.

Since data cleaning and unification are usually performed at the source, the ML solutions need to be able to do inference and predictions on large-scale datasets. Taking the deduplication solution referred to earlier; the classification essentially happens on finding similarities between pairs of records and this is an n-squared problem which is computationally heavy and slow. Therefore, it becomes important to engineer solutions which use techniques like sampling and hashing to reduce complexity.

It is important for people to be part of the process for any ML solution to work. Roles like domain experts and IT experts are essential in transferring legacy operations into useful features and labels to generate training data for ML solutions, to verify results of ML predictions on data quality and to assess the impact of ML cleaning and unification at the source on analytics downstream.

In conclusion, data cleaning and unification at the source are essential to create trustworthy analytics for organizations downstream. It is important to recognize that data quality problems cannot be solved properly in isolation and machine learning solutions that offer holistic approaches to cleaning and unifying data may be the best solution. At the same time, we must understand that in order to develop scaleable ML pipelines that work at the organizational level we must ensure that these solutions build upon legacy operations and bring humans into the loop.

Original post: https://towardsdatascience.com/machine-learning-for-data-cleaning-and-unification-b3213bbd18e

72 comentários em “Machine learning for data cleaning and unification

  1. I’d like to thank you for the efforts you’ve put in penning this site.
    I’m hoping to see the same high-grade content from you later on as well.
    In fact, your creative writing abilities has motivated me
    to get my own, personal website now 😉

  2. I’m the business owner of JustCBD Store company (justcbdstore.com) and I am currently planning to expand my wholesale side of business. It would be great if anybody at targetdomain is able to provide some guidance ! I thought that the most effective way to do this would be to connect to vape shops and cbd stores. I was hoping if anyone could suggest a trusted web site where I can purchase CBD Shops BUSINESS DATA I am already examining creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. Not sure which one would be the most ideal solution and would appreciate any assistance on this. Or would it be much simpler for me to scrape my own leads? Ideas?

  3. I am the business owner of JustCBD Store brand (justcbdstore.com) and am seeking to develop my wholesale side of business. I really hope that someone at targetdomain give me some advice ! I thought that the most suitable way to accomplish this would be to talk to vape stores and cbd retailers. I was really hoping if anyone could suggest a trustworthy web site where I can purchase CBD Shops Sales Leads I am already taking a look at creativebeartech.com, theeliquidboutique.co.uk and wowitloveithaveit.com. On the fence which one would be the most suitable choice and would appreciate any support on this. Or would it be easier for me to scrape my own leads? Ideas?

  4. Hi there! I simply wish to give you a big thumbs up for the great info you’ve got here on this post. I’ll be coming back to your website for more soon.

  5. Hi there! I could have sworn I’ve been to your blog before but after looking at some of the posts I realized it’s new to me. Regardless, I’m certainly happy I found it and I’ll be book-marking it and checking back often!

  6. I blog frequently and I really thank you for your information. The article has truly peaked my interest. I will book mark your website and keep checking for new information about once per week. I subscribed to your Feed too.

  7. Spot on with this write-up, I honestly feel this amazing site needs a lot more attention. I’ll probably be back again to read through more, thanks for the info!

  8. You are so interesting! I don’t suppose I’ve read anything like this before. So wonderful to find somebody with genuine thoughts on this topic. Seriously.. many thanks for starting this up. This website is something that’s needed on the internet, someone with some originality!

  9. Definitely consider that that you stated. Your favorite reason seemed to be on the web the easiest factor to
    bear in mind of. I say to you, I certainly get annoyed at
    the same time as other folks consider concerns that they plainly don’t understand about.
    You controlled to hit the nail upon the highest and also defined
    out the whole thing without having side-effects , other
    people can take a signal. Will likely be again to get more.
    Thank you

  10. After looking at a handful of the blog articles on your website, I seriously appreciate your technique of writing a blog. I book marked it to my bookmark site list and will be checking back in the near future. Please visit my web site too and let me know how you feel.

  11. An intriguing discussion is worth comment. I believe that you should write more on this issue, it might not be a taboo subject but usually people don’t speak about these subjects. To the next! Best wishes!!

  12. I blog quite often and I really thank you for your information. The article has truly peaked my interest. I will take a note of your blog and keep checking for new details about once per week. I opted in for your Feed as well.

  13. This is the perfect blog for everyone who wants to find out about this topic. You realize a whole lot its almost tough to argue with you (not that I really would want to…HaHa). You certainly put a brand new spin on a topic that has been written about for many years. Great stuff, just great!

  14. I’d like to thank you for the efforts you’ve put in penning this website. I really hope to view the same high-grade blog posts from you later on as well. In fact, your creative writing abilities has inspired me to get my own site now 😉

  15. Achieving your fitness goal doesn’t need a certified personal trainer or an expensive gym memberships, it is not hard to exercise at home. It is easy to go down a training and fitness rabbit hole, however, when you’re looking for the best home exercise equipment to outfit your personal home gym.

  16. An interesting discussion is worth comment. There’s no doubt that that you should publish more on this topic, it may not be a taboo subject but generally people do not discuss these subjects. To the next! Best wishes!!

  17. I’d like to thank you for the efforts you have put in penning this blog. I am hoping to see the same high-grade content from you later on as well. In fact, your creative writing abilities has inspired me to get my very own blog now 😉

  18. I need to to thank you for this wonderful read!! I definitely loved every little bit of it. I have got you book marked to look at new things you post…

  19. I’m impressed, I have to admit. Seldom do I come across a blog that’s equally educative and amusing, and without a doubt, you have hit the nail on the head. The problem is something which not enough people are speaking intelligently about. I’m very happy that I found this in my hunt for something regarding this.

  20. Oh my goodness! Awesome article dude! Thank you, However I am having troubles with your RSS. I don’t know the reason why I am unable to subscribe to it. Is there anybody having identical RSS problems? Anybody who knows the solution will you kindly respond? Thanks!!

  21. Hi, I do think this is an excellent web site. I stumbledupon it 😉 I will return once again since I book-marked it. Money and freedom is the greatest way to change, may you be rich and continue to help other people.

  22. May I just say what a relief to discover a person that really knows what they are talking about on the internet. You actually know how to bring an issue to light and make it important. A lot more people ought to read this and understand this side of the story. I was surprised you aren’t more popular because you surely possess the gift.

  23. After looking into a handful of the blog articles on your website, I honestly like your technique of blogging. I saved as a favorite it to my bookmark site list and will be checking back soon. Take a look at my web site too and tell me your opinion.

  24. An outstanding share! I’ve just forwarded this onto a friend who has been conducting a little research on this. And he actually ordered me breakfast simply because I stumbled upon it for him… lol. So allow me to reword this…. Thanks for the meal!! But yeah, thanx for spending the time to discuss this matter here on your web site.

  25. When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and now whenever a comment is added I receive 4 emails with the same comment. Perhaps there is an easy method you are able to remove me from that service? Thanks a lot!

  26. After I initially commented I seem to have clicked the -Notify me when new comments are added- checkbox and now each time a comment is added I receive four emails with the same comment. There has to be a way you can remove me from that service? Many thanks!

  27. Howdy! This blog post could not be written any better! Looking through this post reminds me of my previous roommate! He always kept talking about this. I am going to send this post to him. Pretty sure he will have a very good read. Many thanks for sharing!

  28. You are so cool! I do not think I’ve truly read anything like that before. So wonderful to discover another person with original thoughts on this subject. Seriously.. thank you for starting this up. This web site is one thing that’s needed on the web, someone with a little originality!

  29. After looking into a handful of the blog articles on your blog, I truly appreciate your technique of blogging. I saved as a favorite it to my bookmark webpage list and will be checking back in the near future. Please check out my web site too and tell me your opinion.

  30. I just want to mention I am just very new to blogging and site-building and honestly liked your page. Most likely I’m going to bookmark your blog post . You amazingly come with excellent posts. Cheers for sharing your web-site.

  31. When I originally left a comment I seem to have clicked the -Notify me when new comments are added- checkbox and from now on each time a comment is added I recieve 4 emails with the exact same comment. Is there a means you are able to remove me from that service? Cheers!

  32. Oh my goodness! Impressive article dude! Many thanks, However I am encountering issues with your RSS. I don’t know the reason why I am unable to join it. Is there anybody else having similar RSS issues? Anybody who knows the answer can you kindly respond? Thanx!!

  33. Can I simply say what a comfort to discover someone that truly understands what they’re talking about over the internet. You certainly realize how to bring an issue to light and make it important. More people really need to check this out and understand this side of your story. It’s surprising you are not more popular because you surely have the gift.

  34. I was excited to find this great site. I need to to thank you for your time due to this fantastic read!! I definitely loved every part of it and I have you saved as a favorite to check out new stuff in your web site.

  35. When I originally commented I seem to have clicked on the -Notify me when new comments are added- checkbox and now whenever a comment is added I receive four emails with the same comment. There has to be an easy method you are able to remove me from that service? Many thanks!

  36. I blog quite often and I truly thank you for your content. Your article has truly peaked my interest. I’m going to book mark your website and keep checking for new information about once per week. I subscribed to your RSS feed too.

  37. I’m pretty pleased to discover this great site. I want to to thank you for ones time for this particularly fantastic read!! I definitely appreciated every little bit of it and i also have you saved as a favorite to see new stuff in your site.

  38. Hi, I do think your website could be having web browser compatibility problems. Whenever I take a look at your site in Safari, it looks fine however, when opening in IE, it’s got some overlapping issues. I just wanted to give you a quick heads up! Besides that, fantastic site!

  39. May I simply just say what a relief to discover someone who genuinely understands what they are discussing online. You definitely know how to bring an issue to light and make it important. More and more people really need to read this and understand this side of the story. I was surprised that you aren’t more popular since you surely possess the gift.

Leave a Reply

Your email address will not be published. Required fields are marked *