What is Word2Vec in NLP ? | Natural Language Processing | Word embedding.

Question

What is Word2Vec in NLP ? | Natural Language Processing | Word embedding.

asked Apr 22 in Artificial Intelligence(AI) & Machine Learning by Sharda Chaudhary Goeduhub's Expert (2.1k points)
edited Apr 23 by Sharda Chaudhary

In this article, we will learn what is word2vec and how it is important in text preprocessing to convert a text into a vector to perform various machine learning and deep learning applications. And finally Implementation of word2vec in python.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

S.No.	Course Name	Apply Coupon
1.	Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence	Apply Coupon
2.	Computer Vision with OpenCV \| Deep Learning CNN Projects	Apply Coupon
3.	Complete Machine Learning & Data Science with Python	Apply Coupon
4.	Natural Language Processing-NLP with Deep Learning in Python	Apply Coupon
5.	Computer Vision OpenCV Python \| YOLO\| Deep Learning in Colab	Apply Coupon
6.	Complete Python Programming from scratch with Projects	Apply Coupon

1 Answer

answered Apr 22 by Sharda Chaudhary Goeduhub's Expert (2.1k points)
edited Apr 22 by Sharda Chaudhary

Best answer

To start with word2vec, first, we need to understand, what is most important in text pre-processing, in language translation in text generation, etc...

As we noticed the first thing that comes to our mind, if it's related to text or natural language is the text pre-processing or we can say converting a text into the numerical form or vector to use it further.

We, humans, can think of a language directly, but what about a machine, a machine only understands numbers not characters. So, natural language data in a machine depends on how we convert a character into a vector. And for this our researchers trying hard to develop a better technique and to improve an existing one.

We already discussed word embedding and different word embedding techniques or numerical representation of words, that is

Word Embedding: Converting a text into a vector (numerical representation of a text)

CountVector, Tf-IDF , One-hot-encoding, BOW (Bag of words) ...

What is the problem with these techniques:

In TF-IDF (Term-frequency and inverse document frequency) and BOW, we convert a text into a vector where all values in a vector are zero except the index of a particular word, or we count the frequency of a particular word at the respective index in a text.

And we get a sparse matrix with lots of zero which does not contain semantic information about the text and also causes overfitting because of lots of sparse vectors in a huge amount of data.

A Semantic information is a logical structure of sentences to identify the most relevant elements in text to understand it. For example Grammar and order of sentences in the English language.

Word2Vec

In this specific model, each word in a text is represented as a vector of a fixed dimension instead of based on the amount of data.

Word2vec is not a single algorithm but a combination of model architectures and optimizations that can be used to learn word embeddings from large datasets. Which preserved semantic information and relation between different words in a text.

For example, king-man+women = queen, the relation between words is truly magical which a word2vec learns through a huge amount of data.

How this happens, actually word2vec uses neural networks to generate word embedding of a text which finds out similar contexts to have similar embeddings.

For example; The kid play cricket in the street

The child play cricket in the street

In this case, the child and kid have similar vectors because of similar context.

word2vec comprises two techniques (algorithms) that is CBOW(Continuous bag of words) and the Skip-gram model.

CBOW (Continuous Bag of word)

A CBOW algorithm predicts the middle word based on surrounding context words.

For example

The quick brown fox over the lazy dog --predict -- jump

The order of words in this context not that important the matter here is words.

Working of CBOW

The working process of CBOW involves one-hot encoding of each word in a sentence.

For example, we have a sentence "Python training Goeduhub Technologies "

CBOW

In the above diagram: First, we created a one-hot vector of each word in a sentence.

After that, we consider the first three words that are "Python training Goeduhub" Where we are trying to predict the middle word based on context, that is "Python and Goeduhub".

We get a Predicted output of the word "training" and we will try to match this output to our actual output (one-hot vector of "training") and will update weights for good accuracy or a good match.

We do this process, making a continuous bag of word vectors that is continuous bag of the word (CBOW). For the above example First BOW - Python training Goeduhub second - training Goeduhub Technologies like this.

Skip-Gram

A skip-gram algorithm predicts the context based on the surrounding middle word.

For example,

jump -- predict-- The quick brown fox over the lazy dog.

skip_gram

In the above diagram, we are predicting the context based on the center word. which is predicting Goeduhub, and Python based on Center word "training".

In the end, the softmax function is used to maximize the probability of predicting context. For a sequence of words w1, w2, ... wT, the objective can be written as the average log probability.

prob

where c is the size of the training context. The basic skip-gram formulation defines this probability using the softmax function.

softmaxfunction

where v and v' are target and context vector representations of words and W is vocabulary size.

Note: Word2Vec comprises these two algorithms to give a fixed dimension numerical representation of words in a context and relation between words.

Online Courses	Free Tutorials	Go to Your University	Placement Preparation

Online Training - Youtube Live Class Link

What is Word2Vec in NLP ? | Natural Language Processing | Word embedding.

Goeduhub's Online Courses @Udemy

For Indian Students- INR 570/- || For International Students- $12.99/-

Please log in or register to answer this question.

1 Answer

Word2Vec

CBOW (Continuous Bag of word)

Skip-Gram

Please log in or register to add a comment.

Our Mentors(For AI-ML)

Related questions