Books Online Courses
Free Tutorials  Go to Your University  Placement Preparation 
Latest:- Important tips to get an Off Campus Placements
0 like 0 dislike
1.9k views
in AI-ML-Data Science Projects by Goeduhub's Expert (3.1k points)
Fake News Detection using machine learning (Python)

1 Answer

0 like 0 dislike
by Goeduhub's Expert (3.1k points)
 
Best answer

Fake News 

Fake news is a form of news consisting of deliberate disinformation or hoaxes spread via traditional news media or online social media. Digital news has brought back and increased the usage of fake news, or yellow journalism.

Yellow journalism and the yellow press are American terms for journalism and associated newspapers that present little or no legitimate well-researched news while instead using eye-catching headlines for increased sales.

In this article, we will use the TfidfVectorizer and PassiveAggressive classifier to classify fake news and genuine news. 

TfidfVectorizer (click here for TfidfVectorizer)

Convert a collection of raw documents to a matrix of TF-IDF features.

TF(Term Frequency): The number of times a word has appeared in any document is its term frequency.

IDF(Inverse data frequency):Inverse data frequency determines the weight of rare words across all documents in the corpus.

A Simple Example how TfidfVectorizer works

#Example of TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [

    'This is the first document.',

     'This document is the second document.',

     'And this is the third one.',

     'Is this the first document?',

 ]

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

print(X)

print(vectorizer.get_feature_names())

Output

vectorize image

Note 

From the output given above, we can understand that , First TfidfVectorizer counted the frequency of every word in the corpus and then it define weightage of every word in matrix.As you can see in the output after doing the Tfidf Vectorization, we have total 9 features in the output.  ie 9 columns (features) and 4 Rows (With weightage of every word)

For example we get "the" word multiple times in any text, So TfidfTransformer finds out how much its contribution to the model is in the classification.

PassiveAggressive classifier

The passive-aggressive algorithms are a family of algorithms for large-scale learning.

Official Documentation of PassiveAggressive Algorithms.

Here it is enough to know that this algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting.

Project (Implementation)

Importing required libraries 

#importing libraries 

import numpy as np

import pandas as pd

import itertools

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

Note

Numpy and Pandas used here for data manipulation and itertools module is used here to handle iterators.

Itertools: Python itertools module is a collection of tools for handling iterators. Simply put, iterators are data types that can be used in a for loop. The most common iterator in Python is the list.

In next step we will read datasets that we are going to use here it contains both (fake and real) news. You can download the datasets form here (click here)

#Reading the datasets 

df=pd.read_csv('news.csv')

df.head()

Output

datasets

Note: As you can see in the above output we have news title text in news and label of news (i.e. fake 0 or real 1).

#Defining our features and target 

X=df['text']

Y=df['label']

#using split function 

x_train,x_test,y_train,y_test=train_test_split(X,Y, test_size=0.2, random_state=7)

Note

As we know, we have to know about a text / corpus whether it is fake or real.that means our target is label (fake or real) which we will know from the features i.e. text. After that we split the data into train and test data.Training data  is used to train the model (learning of model), whereas from testing data we see how much the model has learned.

#preprocessing of data (tokenize and creating matrix)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#preprocessing of train data

tfidf_train=tfidf_vectorizer.fit_transform(x_train)

#preprocessing of test data  

tfidf_test=tfidf_vectorizer.transform(x_test)

Note

As we have discussed earlier, the TfidfVectorizer tokenize the data, then converts the data into a matrix form and decide the weightage of the words.(means preprocessing of data).Because the machine cannot understand the documents (doc type), it is necessary to preprocess the data (convert into matrix as we have seen in above  example of TfidfVectorizer).

#classifier or algorithm to learn the model

passive=PassiveAggressiveClassifier(max_iter=50)

passive.fit(tfidf_train,y_train)

y_pred=pac.predict(tfidf_test)

#accuracy of the model

score=accuracy_score(y_test,y_pred)

print(score)

#confusion matrix or kind of error calculation 

confusion_matrix(y_test,y_pred)

Output

final model

Note 

Here we used passiveagressive classifier (or algorithm, it is a kind of supervised learning algorithm ) to train our model. After training of the model we tested our model.Accuracy of our model is 0.92 (ie 92 %) and from the confusion matrix we can clearly see that we have total 586 false news (0) , 586 real news (1) and (52+43) wrong prediction by model.

3.3k questions

7.1k answers

395 comments

4.6k users

 Goeduhub:

About Us | Contact Us || Terms & Conditions | Privacy Policy || Youtube Channel || Telegram Channel © goeduhub.com Social::   |  | 
...