Fake news detection - Text Classification approach02 Dec 2018 by dzlab
Fake news can belong to one of the following categories 1: a news which is intentionally false (i.e. a serious fabrication), hoaxes (i.e. created with the intent to go viral on social media networks) or articles intended as humor or satire. Here is sample legitimate and crowdsourced fake news in the Technology domain 2
|Nintendo Switch game console to launch in March for $299 The Nintendo Switch video game console will sell for about $260 in Japan, starting March 3, the same date as its global rollout in the U.S. and Europe. The Japanese company promises the device will be packed with fun features of all its past machines and more. Nintendo is promising a more immersive, interactive experience with the Switch, including online playing and using the remote controller in games that don’t require players to be constantly staring at a display. Nintendo officials demonstrated features such as using the detachable remote controllers, called ”Joy-Con,” to play a gun-duel game. Motion sensors enable players to feel virtual water being poured into a virtual cup.||New Nintendo Switch game console to launch in March for $99 Nintendo plans a promotional roll out of it’s new Nintendo switch game console. For a limited time, the console will roll out for an introductory price of $99. Nintendo promises to pack the new console with fun features not present in past machines. The new console contains new features such as motion detectors and immerse and interactive gaming. The new introductory price will be available for two months to show the public the new advances in gaming. However, initial quantities will be limited to 250,000 units available at the sales price. So rush out and get yours today while the promotional offer is running.|
The task of detecting fake news requires the application of NLP algorithm to search for patterns or linguistic constructs that could be used to flag an article as fake news. This task is different form fact checking which involves cross-referencing articles with other articles to look for inconsistency in the given information.
An algorithm for detecting fake news with an accuracy that outperform those of humans is a cutting edge AI work. As it involves not only the detection of non-fake news, but also the capabilities of verifying the ground-truth, and accounting for factors such as developing news and language and cultural interpretations.
In the following we address Fake News detection with a Text Classification approach that simply uses an NLP algorithm to parse sentence structure and hone in on keywords to classify news based on a training set with flaged fake and non fake articles content.
The problem with the Fake News detection is that there is not enough data, a collection of articles with speific requirements that constitues a fake news corpus. What researshers usually do is constructing a dataset by crowd-sourcing fake news articles (e.g. through Amazon Mechanical Turk workers).
Fake news Datasets:
The following are some commonly available datasets for training NLP algorithms to detect fake news:
- BuzzFeedNews Facebook fact check dataset - link
- kaggle dataset
- Kagle competition using news headlines in chineese and english (translated) - link
Will be using this dataset. After downloading and un-zipping the file, load it into a dataframe to have a look:
The first thing we need to do is tranform those articles into something that can be processed by computers throught two differents steps:
- Tokenization: split the original sentences into tokens (i.e. words). For example, spliting on spaces, properly handle punctuation, clean the text (e.g. remove HTML tags), separate compound word (e.g isn’t, don’t) in to different words.
- Numericalization: convert the tokens into integers by creating a vocabulary (i.e. list of all the words in the corpus). The size of the vocabulary should be limit (e.g. to 60,000) and contains only useful words (e.g. tokens that appear at least twice). Unfrequent words can be replaced by the unknown token UNK.
Those operation as performed by the folloing simple command (this can be slow):
A language model is a model trained to guess the next word starting from a sequence of words as input. It has a recurrent structure and a hidden state that is updated each time it sees a new word. This hidden state thus contains information about the sentence up to that point. Check this article for more details on how to train a language model from scratch.
We need to train a model that classifies the news from scratch, starting from a model pretrained on a bigger dataset (wikitext-103 3). This pre-trained model catches a ‘knowledge’ of the English language which will be useful to our classifier.
But we should properly handle the specificity of our dataset. In fact, the English of the news in out dataset isn’t the same as the English of wikipedia, we need to adjust a the parameters of our model. Furthermore, words that could be very common in our dataset may not be present in wikipedia, and as a result might not be in the vocabulary of the wikitext-103 model.
Therefore, before jumping on the classification we first need to fine-tune the pretrained model to our particular dataset. We will use a special
TextDataBunch class for the language model that ignores the labels (fake vs. real), the training this model for several epochs as follows:
The following figures depicts the training history which takes several hours to finish:
After that the model is trainined on our dataset, we can try generate news as follows:
The output should look like this:
Total time: 00:02 Total time: 00:02 health care group , and habit of course , it a clear blank documentation on the xxmaj congress defining on fire troop transfer rust care growing discovers that ’s contained , in xxmaj new xxmaj january , the economy and what health care , the xxmaj established to xxmaj israel wanted to the xxup u.s. xxup u.s. xxmaj stein issues like xxmaj intermediate - indecent or both xxmaj presidential proportionally , showing up recently released by xxmaj the xxmaj february
After training a language model on our fakenews dataset, we can use this model to extract features from the articles and use them as a classification attributes.
After several epochs, the accuracy of the classifier reaches
Here is an example of classifly an article:
The output should look like this:
(Category FAKE, tensor(0), tensor([9.9990e-01, 1.0478e-04]))
Full notebook can be found here - link