If you take a step back, you would see that a huge driver for this is news organizations that favor certain sides of the political spectrum. This in turn is reflected in their reporting. Certain facts may completely be omitted and opinionated sentences are blended in, to drive home a particular narrative.
Hence, for my passion project at Metis’ Data Science bootcamp (that I undertook in Fall 2020), I decided to see how data science can help with this matter. I wanted to create “Unbiased News” for the reader to be able to get more impartial information on current events.
Starting from www.allsides.com, I scrapped all their webpages categorized under “stories”. AllSides is a brilliant initiative that takes a news event and collects articles written on it by a left leaning, right leaning and center leaning media outlet. They write a summary on this event and briefly mention what is being emphasized on by each of the three outlets. An example of this can be viewed here. They publish pre-established metrics for the contemporary political bias of all major media outlets.
AllSides only provides a snippet of the original news article but accompanies it with a link to the actual webpage on the same. Hence, each media outlet must be scrapped & parsed individually. Since there can be 100+ news agencies, I prioritized on the 5 most frequently referenced outlets (3 Left Leaning — New York Times, Washington Post, and HuffPost; 2 Right Leaning — Fox News and Washington Times), that covered 1000+ news stories overall. This helped focus the initial scope tremendously.
So, at this point I had a corpus of news events with 2 articles addressing each story — one from a left leaning media outlet and one from a right leaning outlet.
Defining a MVP
Before proceeding ahead, I needed to concretely define what an end product here would even look like. Especially one that I could deliver on within a duration of 3 weeks (of which, I had already spent a week at this point gathering the above data).
As the primary objective in this project, I wanted to be able to take a left & right leaning article (on the same news event) and find all the common points being mentioned by both.
As an example, let’s take two articles, one from New York Times (Left Leaning) and the other from Fox News (Right Leaning) that addresses the story of “U.S. Supreme Court rejecting the Republican bid to have mail ballots tossed out in Pennsylvania” (In relation to Nov 2020 U.S. Elections).
Below are snapshots of a certain portion from each article.
In the above example, the highlighted sentences are essentially saying the same thing; and I would like to find all such similar sentence pairs and produce a cogent summary (using just one of the two sentences in each pair) containing only what is commonly being said by both sides.
In the land of Natural Language Processing (NLP), checking how similar two sentences are is known as comparing the ‘Semantic Textual Similarity’ (STS).
Solving the Core Problem — STS
WARNING: This is the only section of the article that is quite technical. If you’d like, read the part in bold below and skip the rest.
Getting sensible STS results was proving to be quite a challenge, until I came across UKPLab’s Sentence Transformer library (aka S-Bert). It is the best tool currently available for comparing the STS between sentences. It made the most daunting part of this project seem almost like a walk in the park.
To quantitatively check the STS results, I used cosine similarity scores. Cosine similarity outputs a number between 0 and 1, where 1 indicates highly similar and 0 conversely means highly dissimilar.
Getting back to S-Bert, it comes with many transformer based language models (LM) specifically trained for similarity comparison purposes of entire sentences. Amongst these, the ‘roberta-large-nli-stsb-mean-tokens’ LM performs the best as per industry benchmark tests, which is what I decided to use for prototyping purposes in this project. But it is quite large in size and slow to use, making it less than ideal for live production usage.
Now, I simply had to figure out heuristics to solve all the remaining problems and arrive at a practically useful end product.
Getting To The Finish Line
The second part of this endeavor posed a different challenge — it required a lot of text analysis & domain intuition to arrive at a final result which made sense.
Let’s continue with the above mentioned news event on “U.S. Supreme Court rejecting the Republican bid to have mail ballots tossed out in Pennsylvania”
The NYT article here has 53 sentences here and the Fox News article has 22 sentences. Overall, this can make 1166 sentence pairs (53 x 22); only a fraction of which are similar and relevant to my end vision
Keeping that in mind, the various challenges I had to solve for were:
- What approach do I use to compare articles that differ in length?
- If one sentence from the left is highly similar to two or more sentences from the right (or vice-versa), how do I prevent the same sentence from being picked multiple times in sentence pairs?
- When does a sentence pair stop being similar between a score of 0 and 1?
- Of the two sentences in a pair, which one do you pick for making a final “unbiased” news summary?
- How do you arrange the sentences back into a single coherent read instead of reading like a bag of skittles thrown together?
- Finally, what can be done to contain the length of the final article so that it’s a summary and not a very long read.
I won’t be delving into my solutions for these problems, as that can be an entire article by itself. But, post creating heuristics to solve all these problems, I got consistently sensible outputs across various news events!
Below is an example of the final version for the U.S. Supreme court news story we have been referencing.
I deployed a user-friendly proof-of-concept app of this program online using Streamlit. Here, you can enter a link to any left & right articles of your choosing (from the 5 pre-designated media outlets). The program would then scrape these articles and create an “unbiased” summary on the fly. The original articles are also printed further below, with highlights for common sentences (from the original articles) that were used to create the summary itself.
Where to From here?
This project has huge scope to be made into a complete viable product offering.
To do that, we would need to expand the # of possible media outlets that can be tapped into, from the current 5 to possibly 30+.
We should also add a step to first see what’s commonly being said within multiple prominent left & right outlets, and then use these to do a cross-comparison across the aisle to create the “unbiased” summary, along with a “right” & “left” summary.
For the crème de la crème, auto scrapers can be designed to look at story headlines from news aggregator websites every few hours, collect the entire articles, summarize and pre-store them for ready reference, which can also be sent out in a daily email blast to subscribed readers!