How to Implement Artificial Intelligence Using scikit-learn

Data scientists use artificial intelligence (AI) for a vast array of powerful uses. It’s now running control systems reducing building energy consumption, it provides recommendations for clothes to buy or shows to watch, it helps improve farming practices and the amount of food we can grow, and some day it may even be driving our cars for us. Knowing how to use these tools will empower you to solve the next generation of society’s technical challenges.

Fortunately, getting started with artificial intelligence isn’t all that challenging for people who are already experienced with Python and data analysis. You can leverage the powerful scikit-learn package to do most of the hard work for you.

What is scikit-learn?

scikit-learn is a python package designed to facilitate use of machine learning and artificial intelligence algorithms. It includes algorithms used for classification, regression, and clustering including the popular random forests and gradient boosting algorithms. This package was designed to easily interface with the common scientific packages numpy and scipy. Despite not being specifically designed to work with pandas it also excellently interfaces with pandas.

scikit-learn also includes useful tools to facilitate use of the machine learning algorithms. Developing machine learning pipelines that accurately predict the behavior of a system requires splitting the data into training and testing sets, as well as scoring the algorithms to determine how well they function, and ensuring the models are neither overfit nor underfit. The scikit-learn interface includes tools to perform all of these tasks.

How do scikit-learn algorithms work?

Developing and testing scikit-learn algorithms can be performed in three general steps. They are:

  1. Train the model using an existing data set describing the phenomena you need the model to predict.
  2. Test the model on another existing dataset to ensure that it performs well.
  3. Use the model to predict phenomena as needed for your project.

The scikit-learn application programming interface (API) provides commands to perform each of these steps with a single function call. All scikit-learn algorithms use the same function calls for this process, so if you learn it for one you learn it for all.

The function call to train a scikit-learn algorithm is .fit(). To train each model you call the .fit function, and pass it two components of the training data set. The two components are the x data set, providing the data describing the features of the data set, and the y data, providing the data describing the targets of the system (Features and targets are machine learning terms essentially meaning x and y data). The algorithm then creates a mathematical model, as determined by the selected algorithm and the parameters of the model, that matches the provided training data as well as possible. It then stores the parameters in the model, allowing you to call the fit version of the model as needed for your project.

The function to test the fit of the model is .score(). To use this function you again call the function and pass an x data set representing the features and corresponding y data set representing the targets. It’s important that you use a different data set, called the testing data set, than when training the model. A model is quite likely to score very well when scored on the training data because it was mathematically forced to match that data set. The real test is how well the model performs on a different data set, which is the purpose of the testing data set. When calling the .score() function scikit-learn will return the r² value stating how well the model predicted the provided y data set using the provided x data set.

You can predict output of a system given the provided inputs using scikit-learns .predict() function. It’s important that you only do this after fitting the model. Fitting is how you adjust the model to match the data set, so if you don’t fit it first then the model will not provide a valuable prediction. Once the model is fit, you can pass an x data set to the .predict() function and it will return a y data set as predicted by the model. In this way you can predict how a system will behave in the future.

These three functions form the core of the scikit-learn API, and go a long way to getting you applying artificial intelligence to your technical problems.

How do I create training and testing data sets?

Creating separate training and testing data sets is a critical component of training artificial intelligence models. Without doing so we cannot create a model that matches the system we are trying to predict, nor can we verify the accuracy of its predictions. Fortunately, scikit-learn again provides a useful tool to facilitate this process. That tool is the train_test_split() function.

train_test_split() does exactly what it sounds like it does. It splits a provided data set into training and testing data sets. You can use it to create the data sets that you need to ensure that your model correctly predicts the system you’re studying. You provide a data set to train_test_split() and it provides the training and testing data sets that you need. It then returns the dataset split into training and testing data sets that you can use to develop your model.

There are a few things to be careful of when using train_test_split(). First, is that train_test_split() is random in nature. This means that train_test_split() will not return the same training and testing data sets if run with the same input data multiple times. This can be good if you want to test the variability of the model’s accuracy, but it can also be bad if you want to repeatedly use the same dataset on the model. To ensure you get the same result every time you can use the random_state parameter. The random state setting will force it to use the same randomization seed every time you run it, and provide the same data set splits. When using random_state it’s customary to set it to 42, probably as a humorous nod to The Hitchiker’s Guide to the Galaxy more than for any technical reason.

How does this work when put together?

All combined, these tools create a streamlined interface to create and use scikit-learn tools. Let’s talk through it using the example of scikit-learn’s LinearRegression model.

To implement this process we must fist import the tools needed to do so. They include the scikit-learn model, the train_test_split() function, and pandas for the data analysis process. The functions are imported as follows:

from scikit-learn.linear_model import LinearRegression
from scikit-learn.model_selection import train_test_split
import pandas as pd

We can then read in a data set so it’s available for use training and testing the model. I’ve created a realistic data set demonstrating the performance of heat pump water heaters (HPWHs) as a function of their operating conditions specifically for helping people learn data science and engineering skills. Assuming you’ve downloaded that data set and saved it in the same folder as your script you can open it using the following line of code. If not, you can adjust these steps as needed to practice on any data set you like.

data = pd.read_csv('COP_HPWH_f_Tamb&Tavg.csv', index_col = 0)

The next step is to split the data set into the X and y data. To do this we create new data frames specifying the columns of the data set that represent the features and the targets. In the case of HPWHs, the features are tank temperature and ambient temperature while the target is electricity consumption. The dataset contains eight columns showing the water temperature at eight different depths in the water storage tank, each named ‘Tx (deg F)’ where x is a number representing the location of the measurement. It also contains a column showing the measured ambient temperature in the space surrounding the water heater, named ‘T_Amb (deg F)’. Finally, the data set contains a column storing electricity consumption data called ‘P_Elec (W)’. In this case it’s also important to filter our data set such that we only use data when electricity is being consumed. If we skip that step we’ll introduce non-linearity into a linear model, which is setting the model up to fail.

We can accomplish all those steps using the following code:

# Filter the data to only include points where power > 0
data = data[data['P_Elec (W)'] > 0]# Identify X columns and create the X data set
X_columns = ['T_Amb (deg F)']
for i in range(1, 9):
    X_columns.append('T{} (deg F)'.format(i))
X = data[X_columns]# Create the y data set
y = data['P_Elec (W)']

Now that we have X and y data sets we can split those X and y data sets into training and testing data sets. This can be done by calling scikit-learn’s train_test_split() function as follows.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Now that we have training and testing data sets ready to go, we can create and fit the LinearRegression model to the data set. To do so we first create an instance of the model then call the .fit() function as follows.

model = LinearRegression()
model =, y_train)

Note that this implementation used the default parameters of the linear regression model. This may or may not yield a good fit to the data, and we may need to change the parameters to get a good fit. I will write a future article providing advanced ways to do that, but for now using the default parameters is enough to learn these concepts.

The next step is scoring the model on the testing data set to ensure that it fits the data set well. You can do this by calling .score() and passing the testing data.

score = model.score(X_test, y_test)

If the model scores well on the testing data set then chances are you have a well trained, and appropriate for the data set model. If it doesn’t, then you need to consider gathering more data, adjusting the parameters of the model, or using a different model entirely.

If the model performs well, then you can declare the model ready to use and start predicting the behavior of the system. Since we don’t have an additional data set to predict right now, we can simply predict the output on the testing data set. To do that you call the .predict() function as follows.

predict = model.predict(X_test)

The predict variable will now hold the predicted output of the system when exposed to the inputs as defined by X_test. You can then use these outputs to compare to the values in y test directly, enabling you to investigate the model fit and prediction accuracy more carefully.

How well does this model perform?

Since we calculated the score of the model and saved it to the variable score we can quickly see how well the model predicts the electricity consumption of the HPWH. In this case, the score of the model is 0.58.

r² is a metric that runs from zero to one. Zero indicates that the model doesn’t at all explain the observed behavior of the system. One indicates that the model perfectly explains the behavior of the system. An r² value of 0.58 indicates that the model explains a bit over half of the observed behavior, which isn’t great.

The three potential improvements that I mentioned above were:

  • Gathering more data,
  • Adjusting the parameters of the model,
  • Using a different model entirely.

We certainly could gather more data or adjust the parameters of the LinearRegression model, but the core issue here is likely that the relationship between heat pump power consumption and water temperature is non-linear. It’s hard for a linear model to predict something that’s not linear!

We can try the same method using a model that’s designed for non-linear systems and see if we get better results. One possible model is the Random Forest Regressor. We can try that by adding the following code to the end of the script.

from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(), y_train)
score = model.score(X_test, y_test)

This method yields a very high score of 0.9999, which is suspicious in the other way. There’s a reasonable chance that this model is overfit to the data set and won’t actually yield realistic predictions in the future. Unfortunately, that isn’t something that we can truly determine given the available data set. If you were to use this model to start predicting the system you would need to monitor it to see how it performs as more data becomes available, and to keep training it. Plotting the predictions against the measured data would also provide insight into how the model behaves.

For the example of this particular data set, I will say that I would trust this model. That’s because this data set doesn’t contain actual measured data, it’s an example data set that I created by implementing a regression equation to show how a HPWH behaves in these conditions. Which means that the RandomForestRegressor probably matches the data so well because it identified the equation that I used to create the data set.

And with that, you should be in great shape to begin using scikit-learn to implement machine learning and artificial intelligence! If you remember that all scikit-learn algorithms use fit()score(), and predict() functions and that you can create your data sets using train_test_split() then you’re on your way to learning about the different algorithms and predicting actual system behavior.


Original post:

Leave a Reply

Your email address will not be published. Required fields are marked *