Python: Detecting Twitter Bots with Graphs and Machine Learning

The  in Twitter user activity during the recent lockdown made it seem like a good place to start looking for a quarantine project to increase my competency with machine learning. Specifically, as misinformation and   took hold of the U.S.’s online population, trying to come up with new ways to identify bad actors seemed like more and more of a relevant task.

In this post, I’ll be demonstrating, with the help of some useful Python network graphing and machine learning packages, how to build a model for predicting whether Twitter users are humans or bots, using only a minimum viable graph representation of each user.

Outline

1. Preliminary Research

2. Data Collection

3. Data Conversion

4. Training the Classification Model

5. Closing thoughts / Room for Improvement

Technical Notes

All programming, data collection, etc. was done in a Jupyter Notebook.
Libraries used:

tweepy
pandas
igraph
networkx
numpy
json
csv
ast
itemgetter (from operator)
re
Graph2Vec (from karateclub)
xgboost

Finally, four resources were key to this task, which I will discuss later in this writeup:

Let’s get to it!

Preliminary Research

While bot detection as a goal is nothing new, to the extent that a project like this would have been impossible without drawing on the prior and vital work referenced above, there were a few topics within the problem space that I thought could be further explored.

The first was scale and reduction. I wouldn’t describe my data science expertise at any level above “hobbyist”, and as such, processing power and Twitter API access were both factors I had to keep in mind. I knew I wouldn’t be able to replicate the accuracy of models by larger, more established groups, so instead one of the things I set out to investigate was how scalable and accurate of a classification model could be made given these limitations.

The second was the type of user data used in classification. I found several models that drew on a variety of different elements of users’ profiles, from the text content of tweets to the length of usernames or the profile pictures used. However, I found only a   at doing the same with features based on graphs of users’ social networks. By chance, this graph-based method also ended up being the best way for me to collect just enough data on each user to use for later classification without coming up against Twitter’s API limits.

Data Collection

First things first, when working with Twitter, you’ll need developer API access. If you haven’t already, you can apply for it , and Tweepy (the Twitter API wrapper I’ll be using throughout this writeup) has more information on the authentication process .

Once you’ve done so, you’ll need to create an instance of the API with your credentials, like so.

Inputting your credentials

In order to train a model, I would need a database of Twitter usernames and existing labels, as well as a way to quickly collect relevant data about each user.

The database I ended up settling on was IUNI’s excellent , which contains thousands of labeled human and bot accounts in TSV and JSON format. The second part was a bit harder. At first, I tried to generate graph representations of each user’s entire timeline, but this could take more than a day per user for some more prolific users, due to API limits. The best format for a smaller graph ended up being  for identifying Twitter influencers using Eigenvector Centrality.

I won’t repeat the full explanation of how his script works, or what Eigencentrality is, because both of those things are there in his tutorial better than I could put them, but from a high-level view, his script takes a Twitter user (or a keyword, but I didn’t end up using that functionality) as input, and outputs a CSV with an edge list of users weighted by how much “influence” they have on the given user’s interactions on Twitter. It additionally outputs an iGraph object, which we’ll be writing to a GML file that we’ll use going forwards as a unique representation of each user.

You’ll need the functionality of the TweetGrabber, RetweetParser, and TweetGraph classes from his tutorial. The next step is to create a TweetGrabber instance with your API keys, and perform a search on your selected Twitter user.

Creating a GML file of a user’s relationship with their network

Lines 27 and 28 of the above code create a ‘size’ attribute for each vertex that holds its Eigencentrality value, meaning that when we write the created iGraph object to a GML file, as we do in line 31, that file will contain all the information we need on the user, and the previously created CSV can be discarded. Additionally, if you’d like, you can uncomment lines 33–38 to plot and view the graph, which will likely look something like this:

The graph output for a Twitch streamer I follow.

To build the database I would be training my classification model on, I added each of the usernames and labels collected from the Bot Repository into a pandas DataFrame, and iterated through it, running this script with each of the usernames as input. This part of the process was the most time-intensive, taking several hours, but the result, after dropping empty or deleted accounts from the frame, was just over 20,000 user graphs with ‘ground truth’ labels for classification. Next step: formatting this data to train the model.

Data Conversion

But first, a brief refresher on what a model is (if you’re familiar, you can skip to ‘At this point in the process…’).

The goal of a machine learning model is to look at a bunch of information (features) about something , and then to use that information to try and predict a specific statement (or label) about that thing.

For example, this could be a model that takes a person’s daily diet and tries to predict the amount of plaque they would have on their teeth, or this could be a model that takes the kinds of stores a person shops at regularly and tries to predict their hair colour. A model like the first, where the information about the person is more strongly correlated to the characteristic being predicted (diet probably has more impact on dental health than shopping habits do on hair colour), is usually going to be more successful.

The way a model like this is created and is “taught” to more accurately make these predictions is by being exposing it to a large number of somethings that have their features and labels already given. Through “studying” these provided examples, the model ideally “learns” what features are the most correlated with one label or another.
E.g. if the database your model is “studying” contains information on a bunch of people who have the feature “eats marshmallows for breakfast”, and most of these people coincidentally tend to have higher amounts of plaque, your model is probably going to be able to predict that if an unlabeled person also eats marshmallows for breakfast, their teeth won’t be looking so hot.

For a better and more comprehensive explanation, I’d recommend .

At this point in the process, we have a database of somethings (Twitter users), each for which we also have information (their graph) and a yes/no statement (whether or not they’re a bot). However, this brings us to our next step, which is a crucial one in the creation of a model — how to convert these graphs to input features. Providing a model with too much irrelevant information can make it take longer to “learn” from the input, or worse, make its prediction less accurate.

What’s the most efficient way to represent each user’s graph for our model, without losing any important information?

That’s where  comes in. Specifically, , a whole-graph “embedding” library that takes an arbitrarily-sized graph such as the one in the above image, and embeds it as a lower-dimensional vector. For more information about graph embedding (including Graph2Vec specifically), I’d recommend , as well as .
Long story short, Graph2Vec converts graphs into denser representations that preserve properties like structure and information, which is exactly the kind of thing we want for our input features.

In order to do so, we’ll need to convert our graphs into a format that’s compatible with Graph2Vec. For me, the process looked like this:

Creating a vector embedding of a user’s graph

The end result will look something like the following:

[[-0.04542452  0.228086    0.13908194 -0.05709897  0.05758724  0.4356743
   0.16271514  0.09336048  0.05702725 -0.2599525  -0.44161066  0.34562927
   0.3947958   0.30249864 -0.23051494  0.31273103 -0.26534733 -0.10631609
  -0.44468483 -0.17555945  0.07549448  0.38697574  0.2060106   0.08094891
  -0.30476692  0.08177203  0.35429433  0.2300599  -0.26465878  0.07840226
   0.14166194  0.0674125   0.0869598   0.16948421  0.1830279  -0.17096592
  -0.17521448  0.18930815  0.35843915 -0.19418521  0.10822983 -0.25496888
  -0.1363765  -0.2970226   0.33938506  0.09292185  0.02078495  0.27141875
  -0.43539774  0.23756032 -0.11258412  0.01081391  0.44175783 -0.19365656
  -0.04390689  0.09775431  0.03468767  0.06897729  0.2971188  -0.35383108
   0.2914173   0.45880902  0.22477058  0.12225034]]

Not pretty to human eyes, but combined with our labels, exactly what we’ll need for creating the classification model. I repeated this process for each labeled user and stored the results in another pandas DataFrame, so that now I had a DataFrame of ~20,000 rows and 65 columns, 64 of which were vectors describing the user’s graph, and the 65th being the “ground truth” label of whether that user was a bot or a human. Now, on to the final step.

Training the Classification Model

Since our goal is classification (predicting whether each “something” should be placed in one of two categories, in this case a bot or a human), I opted to use ’s XGBClassifier model. XGBoost uses gradient boosting to optimize predictions for regression and classification problems, resulting in, in my experience, more accurate predictions than most other options out there.

From here, there are two different options:

If your goal is to train your own model to make predictions with and modify, and you have a database of user graph vectors and labels to do so with, you’ll need to fit the classification model to your database. This was my process:

Training your own classification model

If your goal is just to try to predict the human-ness or bot-ness of a single user you’ve graphed and vectorized, that’s fine too. I’ve included a JSON file that you can load my model from in my GitHub, which is linked to in my profile. Here’s what that process would look like:

Loading my classification model

And that’s it! You should be looking at the predicted label for the account you started with.

Closing Thoughts / Room for Improvement

There are a number of things that could be improved about my model, and which I hope to revisit someday.

First and foremost is accuracy. While the  that IUNI has built from their Bot Repository demonstrates nearly 100% accuracy at classification on the testing data set, mine demonstrates 79% accuracy. This makes sense, as I’m using a lower number of features to predict each user’s labels, but I’m confident that there is a middle ground between my minimalist approach and IUNI’s, and I would be interested in trying to combine graph-based, text-based, and profile-based classification methods.

Another, which ties into accuracy, is the structure of the graphs themselves. Igraph is able to compute the Eigenvector centrality for each node in a graph, but also includes a number of other node-based measurements, such as closeness, betweenness, or an optimized combination of multiple measurements.

Finally, two things make it difficult to test and improve upon the accuracy of this model at scale. The first is that due to my limited understanding of vector embedding, it’s difficult for me to identify what features lead to accurate or inaccurate labeling. The second is how accurate results on the testing data set would be to Twitter’s ecosystem today. As bots are detected, the field evolves, and methods for detection will have to evolve as well. Towards that end, I’ve been skimming trending topics from Twitter throughout the quarantine for users to apply this model to, but I think that will have to wait for a later post.

Thank you for reading! Please let me know in the comments if you have any questions or feedback.

Original post: https://towardsdatascience.com/python-detecting-twitter-bots-with-graphs-and-machine-learning-41269205ab07

Leave a Reply

Your email address will not be published. Required fields are marked *