How To Reduce Bias in Machine Learning

This is the 2nd part of a series on ML Bias by Serhii Pospielov, AI practice lead, Exadel. While part one focused on what bias in machine learning is and where we can spot it, this article discusses reducing bias in ML and reinforcing ethics to battle bias in artificial intelligence systems. 


Researchers and engineers have already applied several positive practices to reduce ML bias. We will go through each step in the machine learning project pipeline and discuss how to reduce machine learning bias at each stage.

Bias in Data Collection

Bias in data collection is about incorrect obtaining information from various sources based on prejudices and biased assumptions. This can happen when we do not gather the best features in the proper context for our use case or combine data sources. It can also occur when we collect data points from the entire population related to specific groups or trends. Amazon’s hiring process is an excellent example of AI errors in data collection. When they examined their system, they found that the model was gender-biased because men are more represented in Amazon’s specialized departments.

How To Avoid Bias when Collecting Data

When collecting data, it’s essential to have expertise in extracting the most meaningful information variables possible. If you are collecting data for an ML project, you should assign a subject matter expert to the ML team to assist them in capturing key features and their characteristics.


Bias with Pre-Processing

Bias with pre-processing occurs when you don’t fully understand the raw data and have insufficient expertise in interpreting certain variables.

How To Avoid Bias with Pre-Processing Bias

You should choose an appropriate imputation method to mitigate the ML bias and add new imputed values. You should then review the dataset and the imputed values to decide if they reflect the actual observed values. You should follow a different imputation approach to mitigate bias in the model’s predictions. However, regardless of the method chosen, validating a model with offline training/testing data sets cannot capture the reality of an online environment. Therefore, you need to monitor model performance and compare across domains to detect regressions or deviations when they pop up.

Using observational tools from ML to gain deeper insights before discarding outliers is vital. These platforms can provide a better understanding of the nature of outliers and their relevance.

Feature Engineering Bias

Feature engineering bias occurs when the way machine learning models treat an attribute, or set of attributes has a deleterious effect on model results or predictions. These attributes may include social status, gender, or ethnic characteristics.

How To Avoid Feature Engineering Bias

The most crucial step is to scale the features to normalize the range of values of the independent variables or characteristics. To avoid inconsistencies and bias, it’s essential to standardize data with different scales for measuring the same data characteristics. The basic data set usually includes characteristics with various magnitudes, units, and ranges. That’s why feature scaling is necessary to understand these features at the same scale.

To mitigate feature bias, you must consider factors that significantly impact biasing the model’s results when applied to the entire population. These may include gender, race, and regional preferences. Thus, we can address such ML bias factors in the data set, such as gender bias, category bias, and racial bias.


Data Selection Bias


Data selection bias happens when the data used for training isn’t large or representative enough, leading to a misrepresentation of the actual population. If you split a dataset into testing and training data, most data features belong to one data distribution type while another is missing. As a result, a biased model predicts only the feature sets present in the data set.


How To Avoid Data Selection Bias 

Random sampling in data selection can be a good fit if you need to mitigate such ML biases. Simple random sampling is one of the most successful methods researchers use to minimize sampling bias. It ensures that everyone in the population has an equal chance of being selected for the training data set. Another idea is to use stratified random sampling. It allows you to identify a sample population that best represents the overall population of interest and ensures that every subgroup of interest is represented.


Model Training Bias

Model training bias reflects the inconsistency between the actual and trained model results. Some models are suitable for large data sets because they work with many data points. It stores a small data set and provides the model with high accuracy on the training data but doesn’t provide excellent results on the test data. For example, regression and tree-based models are suitable for small data sets.

How To Avoid Bias In Model Training? 

When selecting the appropriate model for the data set, consider essential aspects such as the data type, problem, desired outcome, the amount of data, etc. When selecting a model, try to see the real aim you want to accomplish. Constrained models are better when interpreting the results because it’s easier to understand them. You can immediately see how a single predictor is associated with the response.

Choose the machine learning method that best fits your data set when creating a model. Also, you can make the model selection by choosing four models and then determine the best model with the help of cross-validation. Next, train the final model with the selected model on the dataset and fine-tune the parameters.


Model Validation Bias

It’s not easy to predict the quality of a model by measuring its performance on the training data. Sensitivity analysis using training data is often biased, so it’s better to evaluate the model’s performance using test data. If you train the model with the training dataset, it’ll also have a few data points showing incorrect behavior. As a result, the model accuracy may be 90% with 50% sensitivity or recall in the dataset. When translating the results, you may think that the model provides reliable results. However, this assumption is wrong.


How To Avoid Bias In Model Evaluation 

Start by evaluating your model’s performance with test data to exclude bias from the training environment. When considering, remember that the performance indicator you established earlier is based on the use case. In some circumstances, the model’s sensitivity is more important than the model’s accuracy. You can use the values of the confusion matrix if binary classification is required; for regression models, use distance formulas such as Euclidean distance and root-mean-square error. Then regulate the performance metric to get the correct model score.

Summarizing accurate statistics can be helpful if you want a healthy model and allow you to identify regressions in model predictions immediately. However, if you’re trying to identify ML bias, summarizing statistics can mask areas where your models may not be learning as you intended.


Mindful Leaps forward with ML

Machine learning systems enjoy great popularity and promise to predict the weather or detect diseases. Nevertheless, the field of ML has its challenges, and the presence of biases in ML is one of them that needs to be carefully addressed. These biases and the resulting inaccuracies in the data can also cause harm. Therefore, it’s crucial to understand biases, how to test for them, and how to prevent or remove them. Applying appropriate modeling rules can reduce or eliminate ML bias, and everybody working with ML can help create a more ethical industry.


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *