Implementation of Word2Vec. | Word2Vec | Text-preprocessing | NLP

Question

Implementation of Word2Vec. | Word2Vec | Text-preprocessing | NLP

asked Apr 23 in Artificial Intelligence(AI) & Machine Learning by Sharda Chaudhary Goeduhub's Expert (2.1k points)
edited Apr 23 by Sharda Chaudhary

In this article, we will learn what is word2vec and how it is important in text preprocessing to convert a text into a vector to perform various machine learning and deep learning applications. And finally Implementation of word2vec in python.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

S.No.	Course Name	Apply Coupon
1.	Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence	Apply Coupon
2.	Computer Vision with OpenCV \| Deep Learning CNN Projects	Apply Coupon
3.	Complete Machine Learning & Data Science with Python	Apply Coupon
4.	Natural Language Processing-NLP with Deep Learning in Python	Apply Coupon
5.	Computer Vision OpenCV Python \| YOLO\| Deep Learning in Colab	Apply Coupon
6.	Complete Python Programming from scratch with Projects	Apply Coupon

2 Answers

answered Apr 23 by Sharda Chaudhary Goeduhub's Expert (2.1k points)
edited Apr 23 by Sharda Chaudhary

Best answer

Prerequisite- Word2Vec Theory , NLP

Documentation- GENSIM

GENSIM- GENSIM is a opensource project to implement various models and algorithms.

In this tutorial, we will implement word2vec embedding (family of algorithms) to a corpus. corpus- history of India from wikipedia.

Importing Libraries and models

#importing some important libraries

import nltk

#importing word2vec

from gensim.models import Word2Vec

from nltk.corpus import stopwords

import re

Note

You, off course aware of above imported libraries. Let's take a look to these libraries and packages

NLTK: nltk stands for natural language toolkit is a most important NLP library to preprocess the text data (human language) to convert the data into a format, that can be used by machine to process further.

RegEx (re): A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Wrod2Vec: To convert a word into a vector (embedding), which can be used by machine for various purpose. (translation, image captioning etc...)

Stopwords: Stopwords is a nltk package to ignore the words that have no meaning in a sentence but are used to understand a sentence logically. (for example, The, are, of, it, from etc....).

To use "Punkt" and "Stopwords" packages of nltk. First, we have to download them.

nltk.download('punkt') and nltk.download('stopwords')

Note: An error will occur, if you skip this step. Make sure to download these packages of nltk.

#loading data (corpus)

corpus = """According to consensus in modern genetics anatomically

modern humans first arrived on the Indian subcontinent from Africa

between 73,000 and 55,000 years ago.However, the earliest known

human remains in South Asia date to 30,000 years ago.

Settled life, which involves the transition from foraging to farming

and pastoralism, began in South Asia around 7,000 BCE.

At the site of Mehrgarh, Balochistan, Pakistan, presence can be

documented of the domestication of wheat and barley, rapidly

followed by that of goats, sheep, and cattle.

By 4,500 BCE, settled life had spread more widely, and began to

gradually evolve into the Indus Valley Civilization, an early

civilization of the Old world, which was contemporaneous with

Ancient Egypt and Mesopotamia.

This civilisation flourished between 2,500 BCE and 1900 BCE in what

today is Pakistan and north-western India, and was noted for its

urban planning, baked brick houses, elaborate drainage, and water

supply.

In early second millennium BCE persistent drought caused the

population of the Indus Valley to scatter from large urban centres

to villages.

Around the same time, Indo-Aryan tribes moved into the Punjab from

regions further northwest in several waves of migration.

The resulting Vedic period was marked by the composition of the

Vedas, large collections of hymns of these tribes whose postulated

religious culture, through synthesis with the preexisting religious

cultures of the subcontinent, gave rise to Hinduism.

The caste system, which created a hierarchy of priests, warriors,

and free peasants, but which excluded indigenous peoples by labeling

their occupations impure, arose later during this period.

Towards the end of the period, around 600 BCE, after the pastoral

and nomadic Indo-Aryans spread from the Punjab into the Gangetic

plain, large swaths of which they deforested to pave way for

agriculture, a second urbanisation took place.

The small Indo-Aryan chieftaincies, or janapadas, were

consolidated into larger states, or mahajanapadas.

This urbanisation was accompanied by the rise of new ascetic

movements in Greater Magadha, including Jainism and Buddhism,

which opposed the growing influence of Brahmanism and the primacy

of rituals, presided by Brahmin priests, that had come to be

associated with Vedic religion,and gave rise to new religious

concepts."""

Note: In this block we just loaded our data sets (history of India from wikipedia)

Sharda Chaudhary · Answer 1 · 2021-04-23T11:08:51+0000

# Preprocessing of raw text

text = re.sub(r'\[[0-9]*\]',' ',corpus)

text = re.sub(r'\s+',' ',text)

text = text.lower()

text = re.sub(r'\d',' ',text)

text = re.sub(r'\s+',' ',text)

# Sentence tokenizing

sentences = nltk.sent_tokenize(text)

#word tokenizing

sentences = [nltk.word_tokenize(sentence)

for sentence in sentences]

for i in range(len(sentences)):

sentences[i] = [word for word in sentences[i]

if word not in stopwords.words('english')]

#Printing a sentence

print(sentences[i])

Output

raw_text

Note

In the above output you can see the preprocessed data. In the above block, First, we removed all unnecessary symbols, numbers and commas from our text.

And converted all characters to lower cast characters, this is done to reduce unnecessary vocab.

For example, India and india both are same, but if we do not perform lower cast operation. Our vocabulary will consider both as different vocab. which is not required.

After removing all these symbols and signs, we tokenized each sentence and then each word in text.

# Training the Word2Vec model

model = Word2Vec(sentences, min_count=1)

#printing vocab

vocab = model.wv.vocab

print(vocab)

Output

vocab

Note

Here, we passed our preprocessed data to word2vec to convert the words into vector. As you see in the output after passing the text from word2vec we printed vocab recognize by word2vec (blue underline - consensus , morden and more when you run the program).

These are the unique words recognize by word2vec from text.

min_count (int, optional) – Ignores all words with total frequency lower than this.

If you want to see vector of a word you have to run the below code. For example, I want to see vector of word (vocab) "jainism".

#Priting vector of a word or vocab

vector = model.wv['jainism']

print(vector)

Output

vocab_vec

Note: As you see the vector generated by word2vec is fixed dimension and dense. This is what ,I mentioned in theory and solution of sparse matrix and overfitting in TF-IDF and BOW (Bag of word).

# Similar words of a word

similar_words = model.wv.most_similar('jainism')

print(similar_words)

Output

vocab_simlar

Note

We know that word2vec, represent relation between words, so here is a example, In this block of code we printed similar words of word "jainism". And similar words to this word are: Religion (Jainism related to religion), bce (Before common era, originated before AD) , hymns (old songs to praise God).

Online Courses	Free Tutorials	Go to Your University	Placement Preparation

Online Training - Youtube Live Class Link

Implementation of Word2Vec. | Word2Vec | Text-preprocessing | NLP

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

Please log in or register to answer this question.

2 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Our Mentors(For AI-ML)

Related questions