Fake News
Fake news is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media or online social media. Digital news has brought back and increased the usage of fake news, or yellow journalism.
Yellow journalism and the yellow press are American terms for journalism and associated newspapers that present little or no legitimate well-researched news while instead using eye-catching headlines for increased sales.
In this article, we will use the TfidfVectorizer and PassiveAggressive classifier to classify fake news and genuine news.
TfidfVectorizer (click here for TfidfVectorizer)
Convert a collection of raw documents to a matrix of TF-IDF features.
TF(Term Frequency): The number of times a word has appeared in any document is its term frequency.
IDF(Inverse data frequency):Inverse data frequency determines the weight of rare words across all documents in the corpus.
A Simple Example how TfidfVectorizer works
#Example of TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X)
print(vectorizer.get_feature_names())
|
Output
Note
From the output given above, we can understand that , First TfidfVectorizer counted the frequency of every word in the corpus and then it define weightage of every word in matrix.As you can see in the output after doing the Tfidf Vectorization, we have total 9 features in the output. ie 9 columns (features) and 4 Rows (With weightage of every word)
For example we get "the" word multiple times in any text, So TfidfTransformer finds out how much its contribution to the model is in the classification.
PassiveAggressive classifier
The passive-aggressive algorithms are a family of algorithms for large-scale learning.
Official Documentation of PassiveAggressive Algorithms.
Here it is enough to know that this algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting.
Project (Implementation)
Importing required libraries
#importing libraries
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
|
Note
Numpy and Pandas used here for data manipulation and itertools module is used here to handle iterators.
Itertools: Python itertools module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for loop. The most common iterator in Python is the list.
In next step we will read datasets that we are going to use here it contains both (fake and real) news. You can download the datasets form here (click here)
#Reading the datasets
df=pd.read_csv('news.csv')
df.head()
|
Output
Note: As you can see in the above output we have news title text in news and label of news (i.e. fake 0 or real 1).
#Defining our features and target
X=df['text']
Y=df['label']
#using split function
x_train,x_test,y_train,y_test=train_test_split(X,Y, test_size=0.2, random_state=7)
|
Note
As we know, we have to know about a text / corpus whether it is fake or real.that means our target is label (fake or real) which we will know from the features i.e. text. After that we split the data into train and test data.Training data is used to train the model (learning of model), whereas from testing data we see how much the model has learned.
#preprocessing of data (tokenize and creating matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
#preprocessing of train data
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
#preprocessing of test data
tfidf_test=tfidf_vectorizer.transform(x_test)
|
Note
As we have discussed earlier, the TfidfVectorizer tokenize the data, then converts the data into a matrix form and decide the weightage of the words.(means preprocessing of data).Because the machine cannot understand the documents (doc type), it is necessary to preprocess the data (convert into matrix as we have seen in above example of TfidfVectorizer).
#classifier or algorithm to learn the model
passive=PassiveAggressiveClassifier(max_iter=50)
passive.fit(tfidf_train,y_train)
y_pred=pac.predict(tfidf_test)
#accuracy of the model
score=accuracy_score(y_test,y_pred)
print(score)
#confusion matrix or kind of error calculation
confusion_matrix(y_test,y_pred)
|
Output
Note
Here we used passiveagressive classifier (or algorithm, it is a kind of supervised learning algorithm ) to train our model. After training of the model we tested our model.Accuracy of our model is 0.92 (ie 92 %) and from the confusion matrix we can clearly see that we have total 586 false news (0) , 586 real news (1) and (52+43) wrong prediction by model.