Data Science Fails: Fake News, Fake Data | DataRobot

I read most of my news online, and I’m not alone. According to a Pew Research Center survey, a third of people prefer to get their news online. For some demographics the percentage is even higher. For example, 76% of people aged 18 to 49 who prefer to read the news also prefer to read their news online, versus only 8% via printed newspapers. The percentage of people reading news online is probably even higher due to social media, with 68% of U.S. adults using Facebook, most of them daily.

Here’s a snippet from a news story I read online today:

“There was a corrupt politician, a corrupt businessman, a corrupt doctor, and a corrupt lawyer,” Trump told a crowd in North Carolina. “These are all bad people. This is a bad, bad place.”

At first glance, this news story doesn’t seem remarkable, but an AI wrote the text, and it is fake news. I used an online AI powered by the GPT-2 algorithm to automatically write the news article. All I had to do was write the first few words, “There was a corrupt politician,” and the algorithm wrote the remainder of the story. A Google search shows that Trump never said these quotes. Yet the AI wrote a news article that reads coherently and convincingly, as if a human wrote it. After the algorithm chose to include a quote, it also chose the imitate Trump’s speaking style.

The news story writing algorithm represents a significant step forward in AI capabilities with natural language. Research lab OpenAI announced the GPT-2 algorithm in February 2019. Give it a fake headline, or the starting phrase, and the algorithm writes the rest of the article, complete with fake quotations and statistics. OpenAI was so worried about the potential for misuse of its system that it wouldn’t share the training dataset nor the full codebase it ran on. But several months have passed, and OpenAI now says it has seen “no strong evidence of misuse” and has released the model in full, despite research that “demonstrated that it’s possible to create models that can generate synthetic propaganda” for extremist ideologies.

To summarise, AI has become quite convincing at writing fake news. But can AI also protect us from fake news?

Case Study: Detecting Fake News

In January 2018, Twitter admitted that more than 50,000 Russia-linked accounts used its service to post automated material about the 2016 U.S. election. The posts had reached at least 677,775 Americans. In response, Twitter removed 50,258 accounts and passed their details to investigators.

Similarly, an investigation into fake news published on Facebook found that fake news on political topics reached 158 million people. This number only included stories that had already been fact-checked and debunked by reputable U.S. fact-checking organizations. Fake news stories spanned the political spectrum. One fake news story falsely claimed that Trump’s grandfather was a pimp and tax evader, and his father was a member of the KKK. Another fake news story alleged that Nancy Pelosi diverted billions of dollars to cover the costs of the impeachment process. In total, these two stories had more than 50 million views. The report concluded that “Facebook’s measures have largely failed to reduce the spread of viral disinformation on the platform.” Nick Clegg, the communications chief for Facebook, has said that they “do not verify the claims of politicians for factual accuracy.”

Social media companies have hired thousands of employees to prevent the spread of fake news on their platforms. Yet with so much disinformation occurring at such a scale, it is impossible for humans to manually detect and correct all fake news – and now that disinformation can be automated by using AI, the task of manual detection is looking even more hopeless. Therefore, researchers have begun training AIs to automatically detect fake news. While the research field is young, there has been progress with promising results.

One approach detects fake news using stylometry-based provenance, i.e., tracing a text’s writing style back to its producing source and determining whether that source is malicious. The stylometry-based approach assumes that fake news can be identified solely by determining the source that generated the text. For example, it assumes that news articles from a mainstream newspaper are more accurate than posts from a propaganda website. Or the identified source, whether the article is written by a human or generated by a machine. Researchers report achieving up to 71% accuracy with a stylometry-based approach, depending on the dataset used.

Another approach trains against the FEVER (Fact Extraction and VERification) dataset, which consists of 185,445 “claims” generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the text from which they provide are derived. Rigorous fact verification, used in manual processes for identifying fake news, requires validating a claim against reliably sourced evidence. However, for the FEVER dataset, claim-only classifiers (which don’t validate against the evidence) perform competitively versus the top evidence-aware models. Researchers report achieving predictive accuracy of up to 61% on the FEVER dataset.

Accuracy percentages of 71% and 61% may not be perfect, but they do signal progress in the war against fake news. However, recently published research by an MIT team shows that we shouldn’t take these accuracy rates at face value. The problem is that fake news detection developers had chosen an easy-to-beat benchmark.

In the paper “Are We Safe Yet? The Limitations of Distributional Features For Fake News Detection“, researchers identify a problem with provenance-based approaches against attackers that generate fake news: fake and legitimate texts can originate from nearly identical sources. The legitimate text might be auto-generated in a similar process to that of fake text. Also, attackers can automatically corrupt articles originating from legitimate human sources, keeping the writing style of the original. The authors demonstrate that stylometry-based methods fail when truthful and fake information that are both machine-generated, and when one inverts the negation of statements in a human-written news story. Existing AIs using this technique are predicting whether a news story was written by humans or computers, not whether it was fake news. They concluded that it is essential to assess the veracity of the text rather than solely relying on its writing style or source.

Similarly, in the paper “Towards Debiasing Fact Verification Models,” a related group of researchers identified a problem in the FEVER database that caused the unexpected result that claim-only classifiers performed competitively against evidence-aware classifiers. The authors identify strong cues for predicting labels solely based on the claims made. The FEVER dataset was constructed using crowdsourcing, allowing human bias to introduce human artifacts into the data. For example, the presence of negatively phrased statements was highly predictive of text flagged as fake news, yet this was not representative of fake news in the real world. Modern machine learning algorithms quickly identified these artifacts of human bias. The algorithms weren’t learning to identify fake news; they were merely identifying when crowdsourcing had reworded the original text. The authors proposed a method to de-bias the data so that human artifacts wouldn’t be predictive of the outcome.


It seems that AI is better at creating fake news than identifying it. But the lesson learned from the case study on detecting fake news can be applied to any AI project. It is essential that you apply critical thinking skills to training and evaluating your AIs:

  • Beware of using fake data to train your AI. Fake data, whether simulated or crowdsourced by humans, is usually not the same as real data. Machine learning algorithms will try to cheat, to learn to identify the artifacts of the data creation instead of the characteristics of real life.

  • Beware of using proxies for the outcome that you wish to predict or decide. Those proxies may not safely align with your intended outcome.

  • Insist that your AIs provide human-friendly explanations for how they are working and why they made their decisions. Then check whether those explanations are showing whether the AI is finding true-to-life patterns, or just cheating.


Original post:

28 comentários em “Data Science Fails: Fake News, Fake Data | DataRobot

  1. I’m the owner of JustCBD Store brand ( and I’m presently seeking to grow my wholesale side of company. I am hoping anybody at targetdomain can help me ! I considered that the most suitable way to do this would be to connect to vape shops and cbd retailers. I was hoping if anyone could recommend a dependable web site where I can buy Vape Shop Database Leads I am presently reviewing, and Not exactly sure which one would be the very best selection and would appreciate any assistance on this. Or would it be easier for me to scrape my own leads? Ideas?

  2. I am the business owner of JustCBD Store brand ( and I am currently aiming to expand my wholesale side of business. I really hope that someone at targetdomain give me some advice . I considered that the most effective way to accomplish this would be to talk to vape stores and cbd retailers. I was hoping if someone could suggest a trustworthy web-site where I can purchase CBD Shops B2B Data I am currently considering, and On the fence which one would be the most suitable solution and would appreciate any assistance on this. Or would it be much simpler for me to scrape my own leads? Suggestions?

  3. After looking into a few of the blog posts on your web page, I honestly like your technique of writing a blog. I added it to my bookmark webpage list and will be checking back in the near future. Take a look at my web site as well and let me know what you think.

  4. It’s nearly impossible to find experienced people for this topic, however, you seem like you know what you’re talking about! Thanks

  5. I absolutely love your website.. Very nice colors & theme. Did you develop this website yourself? Please reply back as I’m hoping to create my own personal site and would like to find out where you got this from or exactly what the theme is called. Many thanks!

  6. Aw, this was a very nice post. Spending some time and actual effort to create a good article… but what can I say… I procrastinate a lot and don’t seem to get anything done.

  7. Hi there! I simply would like to give you a big thumbs up for your excellent information you’ve got right here on this post. I will be returning to your web site for more soon.

  8. Right here is the right webpage for everyone who wants to understand this topic. You realize so much its almost tough to argue with you (not that I really would want to…HaHa). You certainly put a fresh spin on a subject that has been discussed for years. Excellent stuff, just excellent!

  9. Achieving your fitness goal doesn’t need a certified personal trainer or an expensive gym membership, especially when you have the budget and the space to consider practically every workout machine in the market.

  10. When I originally left a comment I seem to have clicked on the -Notify me when new comments are added- checkbox and now each time a comment is added I recieve 4 emails with the same comment. There has to be a way you are able to remove me from that service? Many thanks!

  11. I’m amazed, I have to admit. Rarely do I come across a blog that’s both equally educative and entertaining, and let me tell you, you have hit the nail on the head. The issue is something that too few folks are speaking intelligently about. Now i’m very happy I stumbled across this in my hunt for something concerning this.

  12. Hi, I do believe this is an excellent blog. I stumbledupon it 😉 I will revisit yet again since i have bookmarked it. Money and freedom is the greatest way to change, may you be rich and continue to help other people.

  13. I’m more than happy to find this web site. I want to to thank you for ones time for this particularly fantastic read!! I definitely liked every little bit of it and I have you book marked to look at new information on your site.

  14. I absolutely love your website.. Great colors & theme. Did you build this site yourself? Please reply back as I’m hoping to create my own blog and would love to find out where you got this from or what the theme is called. Thanks!

  15. You are so awesome! I do not think I have read through something like this before. So wonderful to find someone with genuine thoughts on this topic. Really.. thanks for starting this up. This website is one thing that is needed on the web, someone with a bit of originality!

  16. When I initially commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on every time a comment is added I receive 4 emails with the same comment. Perhaps there is a means you are able to remove me from that service? Thank you!

  17. Hi there! This post couldn’t be written much better! Reading through this post reminds me of my previous roommate! He constantly kept preaching about this. I am going to forward this article to him. Pretty sure he will have a great read. Many thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *