The uptick in Twitter user activity during the recent lockdown made it seem like a good place to start looking for a quarantine project to increase my competency with machine learning. Specifically, as misinformation and baffling conspiracies took hold of the U.S.’s online population, trying to come up with new ways to identify bad actors seemed like more and more of a relevant task.
In this post, I’ll be demonstrating, with the help of some useful Python network graphing and machine learning packages, how to build a model for predicting whether Twitter users are humans or bots, using only a minimum viable graph representation of each user.
1. Preliminary Research
2. Data Collection
3. Data Conversion
4. Training the Classification Model
5. Closing thoughts / Room for Improvement
All programming, data collection, etc. was done in a Jupyter Notebook.
tweepy pandas igraph networkx numpy json csv ast itemgetter (from operator) re Graph2Vec (from karateclub) xgboost
Finally, four resources were key to this task, which I will discuss later in this writeup:
- The Indiana University Network Science Institute’s Bot Repository,
- Jacob Moore’s tutorial on identifying Twitter influencers, using Eigenvector Centrality as a metric,
- Karate Club, an extension of NetworkX,
- and the XGBoost gradient boosting library.
Let’s get to it!
While bot detection as a goal is nothing new, to the extent that a project like this would have been impossible without drawing on the prior and vital work referenced above, there were a few topics within the problem space that I thought could be further explored.
The first was scale and reduction. I wouldn’t describe my data science expertise at any level above “hobbyist”, and as such, processing power and Twitter API access were both factors I had to keep in mind. I knew I wouldn’t be able to replicate the accuracy of models by larger, more established groups, so instead one of the things I set out to investigate was how scalable and accurate of a classification model could be made given these limitations.
The second was the type of user data used in classification. I found several models that drew on a variety of different elements of users’ profiles, from the text content of tweets to the length of usernames or the profile pictures used. However, I found only a few attempts at doing the same with features based on graphs of users’ social networks. By chance, this graph-based method also ended up being the best way for me to collect just enough data on each user to use for later classification without coming up against Twitter’s API limits.
First things first, when working with Twitter, you’ll need developer API access. If you haven’t already, you can apply for it here, and Tweepy (the Twitter API wrapper I’ll be using throughout this writeup) has more information on the authentication process in its docs.
Once you’ve done so, you’ll need to create an instance of the API with your credentials, like so.
In order to train a model, I would need a database of Twitter usernames and existing labels, as well as a way to quickly collect relevant data about each user.
The database I ended up settling on was IUNI’s excellent Bot Repository, which contains thousands of labeled human and bot accounts in TSV and JSON format. The second part was a bit harder. At first, I tried to generate graph representations of each user’s entire timeline, but this could take more than a day per user for some more prolific users, due to API limits. The best format for a smaller graph ended up being Jacob Moore’s tutorial for identifying Twitter influencers using Eigenvector Centrality.
I won’t repeat the full explanation of how his script works, or what Eigencentrality is, because both of those things are there in his tutorial better than I could put them, but from a high-level view, his script takes a Twitter user (or a keyword, but I didn’t end up using that functionality) as input, and outputs a CSV with an edge list of users weighted by how much “influence” they have on the given user’s interactions on Twitter. It additionally outputs an iGraph object, which we’ll be writing to a GML file that we’ll use going forwards as a unique representation of each user.
You’ll need the functionality of the TweetGrabber, RetweetParser, and TweetGraph classes from his tutorial. The next step is to create a TweetGrabber instance with your API keys, and perform a search on your selected Twitter user.
Lines 27 and 28 of the above code create a ‘size’ attribute for each vertex that holds its Eigencentrality value, meaning that when we write the created iGraph object to a GML file, as we do in line 31, that file will contain all the information we need on the user, and the previously created CSV can be discarded. Additionally, if you’d like, you can uncomment lines 33–38 to plot and view the graph, which will likely look something like this:
To build the database I would be training my classification model on, I added each of the usernames and labels collected from the Bot Repository into a pandas DataFrame, and iterated through it, running this script with each of the usernames as input. This part of the process was the most time-intensive, taking several hours, but the result, after dropping empty or deleted accounts from the frame, was just over 20,000 user graphs with ‘ground truth’ labels for classification. Next step: formatting this data to train the model.
But first, a brief refresher on what a model is (if you’re familiar, you can skip to ‘At this point in the process…’).
The goal of a machine learning model is to look at a bunch of information (features) about something , and then to use that information to try and predict a specific statement (or label) about that thing.
For example, this could be a model that takes a person’s daily diet and tries to predict the amount of plaque they would have on their teeth, or this could be a model that takes the kinds of stores a person shops at regularly and tries to predict their hair colour. A model like the first, where the information about the person is more strongly correlated to the characteristic being predicted (diet probably has more impact on dental health than shopping habits do on hair colour), is usually going to be more successful.
The way a model like this is created and is “taught” to more accurately make these predictions is by being exposing it to a large number of somethings that have their features and labels already given. Through “studying” these provided examples, the model ideally “learns” what features are the most correlated with one label or another.
E.g. if the database your model is “studying” contains information on a bunch of people who have the feature “eats marshmallows for breakfast”, and most of these people coincidentally tend to have higher amounts of plaque, your model is probably going to be able to predict that if an unlabeled person also eats marshmallows for breakfast, their teeth won’t be looking so hot.
For a better and more comprehensive explanation, I’d recommend this video.
At this point in the process, we have a database of somethings (Twitter users), each for which we also have information (their graph) and a yes/no statement (whether or not they’re a bot). However, this brings us to our next step, which is a crucial one in the creation of a model — how to convert these graphs to input features. Providing a model with too much irrelevant information can make it take longer to “learn” from the input, or worse, make its prediction less accurate.
What’s the most efficient way to represent each user’s graph for our model, without losing any important information?
That’s where Karate Club comes in. Specifically, Graph2Vec, a whole-graph “embedding” library that takes an arbitrarily-sized graph such as the one in the above image, and embeds it as a lower-dimensional vector. For more information about graph embedding (including Graph2Vec specifically), I’d recommend this writeup, as well as this white paper.
Long story short, Graph2Vec converts graphs into denser representations that preserve properties like structure and information, which is exactly the kind of thing we want for our input features.
In order to do so, we’ll need to convert our graphs into a format that’s compatible with Graph2Vec. For me, the process looked like this:
The end result will look something like the following:
[[-0.04542452 0.228086 0.13908194 -0.05709897 0.05758724 0.4356743 0.16271514 0.09336048 0.05702725 -0.2599525 -0.44161066 0.34562927 0.3947958 0.30249864 -0.23051494 0.31273103 -0.26534733 -0.10631609 -0.44468483 -0.17555945 0.07549448 0.38697574 0.2060106 0.08094891 -0.30476692 0.08177203 0.35429433 0.2300599 -0.26465878 0.07840226 0.14166194 0.0674125 0.0869598 0.16948421 0.1830279 -0.17096592 -0.17521448 0.18930815 0.35843915 -0.19418521 0.10822983 -0.25496888 -0.1363765 -0.2970226 0.33938506 0.09292185 0.02078495 0.27141875 -0.43539774 0.23756032 -0.11258412 0.01081391 0.44175783 -0.19365656 -0.04390689 0.09775431 0.03468767 0.06897729 0.2971188 -0.35383108 0.2914173 0.45880902 0.22477058 0.12225034]]
Not pretty to human eyes, but combined with our labels, exactly what we’ll need for creating the classification model. I repeated this process for each labeled user and stored the results in another pandas DataFrame, so that now I had a DataFrame of ~20,000 rows and 65 columns, 64 of which were vectors describing the user’s graph, and the 65th being the “ground truth” label of whether that user was a bot or a human. Now, on to the final step.
Training the Classification Model
Since our goal is classification (predicting whether each “something” should be placed in one of two categories, in this case a bot or a human), I opted to use XGBoost’s XGBClassifier model. XGBoost uses gradient boosting to optimize predictions for regression and classification problems, resulting in, in my experience, more accurate predictions than most other options out there.
From here, there are two different options:
If your goal is to train your own model to make predictions with and modify, and you have a database of user graph vectors and labels to do so with, you’ll need to fit the classification model to your database. This was my process:
If your goal is just to try to predict the human-ness or bot-ness of a single user you’ve graphed and vectorized, that’s fine too. I’ve included a JSON file that you can load my model from in my GitHub, which is linked to in my profile. Here’s what that process would look like:
And that’s it! You should be looking at the predicted label for the account you started with.
Closing Thoughts / Room for Improvement
There are a number of things that could be improved about my model, and which I hope to revisit someday.
First and foremost is accuracy. While the Botometer classification model that IUNI has built from their Bot Repository demonstrates nearly 100% accuracy at classification on the testing data set, mine demonstrates 79% accuracy. This makes sense, as I’m using a lower number of features to predict each user’s labels, but I’m confident that there is a middle ground between my minimalist approach and IUNI’s, and I would be interested in trying to combine graph-based, text-based, and profile-based classification methods.
Another, which ties into accuracy, is the structure of the graphs themselves. Igraph is able to compute the Eigenvector centrality for each node in a graph, but also includes a number of other node-based measurements, such as closeness, betweenness, or an optimized combination of multiple measurements.
Finally, two things make it difficult to test and improve upon the accuracy of this model at scale. The first is that due to my limited understanding of vector embedding, it’s difficult for me to identify what features lead to accurate or inaccurate labeling. The second is how accurate results on the testing data set would be to Twitter’s ecosystem today. As bots are detected, the field evolves, and methods for detection will have to evolve as well. Towards that end, I’ve been skimming trending topics from Twitter throughout the quarantine for users to apply this model to, but I think that will have to wait for a later post.
Thank you for reading! Please let me know in the comments if you have any questions or feedback.