To start with word2vec, first, we need to understand, what is most important in text pre-processing, in language translation in text generation, etc...
As we noticed the first thing that comes to our mind, if it's related to text or natural language is the text pre-processing or we can say converting a text into the numerical form or vector to use it further.
We, humans, can think of a language directly, but what about a machine, a machine only understands numbers not characters. So, natural language data in a machine depends on how we convert a character into a vector. And for this our researchers trying hard to develop a better technique and to improve an existing one.
We already discussed word embedding and different word embedding techniques or numerical representation of words, that is
Word Embedding: Converting a text into a vector (numerical representation of a text)
CountVector, Tf-IDF , One-hot-encoding, BOW (Bag of words) ...
What is the problem with these techniques:
In TF-IDF (Term-frequency and inverse document frequency) and BOW, we convert a text into a vector where all values in a vector are zero except the index of a particular word, or we count the frequency of a particular word at the respective index in a text.
And we get a sparse matrix with lots of zero which does not contain semantic information about the text and also causes overfitting because of lots of sparse vectors in a huge amount of data.
A Semantic information is a logical structure of sentences to identify the most relevant elements in text to understand it. For example Grammar and order of sentences in the English language.
Word2Vec
In this specific model, each word in a text is represented as a vector of a fixed dimension instead of based on the amount of data.
Word2vec is not a single algorithm but a combination of model architectures and optimizations that can be used to learn word embeddings from large datasets. Which preserved semantic information and relation between different words in a text.
For example, king-man+women = queen, the relation between words is truly magical which a word2vec learns through a huge amount of data.
How this happens, actually word2vec uses neural networks to generate word embedding of a text which finds out similar contexts to have similar embeddings.
For example; The kid play cricket in the street
The child play cricket in the street
In this case, the child and kid have similar vectors because of similar context.
word2vec comprises two techniques (algorithms) that is CBOW(Continuous bag of words) and the Skip-gram model.
CBOW (Continuous Bag of word)
A CBOW algorithm predicts the middle word based on surrounding context words.
For example
The quick brown fox over the lazy dog --predict -- jump
The order of words in this context not that important the matter here is words.
Working of CBOW
The working process of CBOW involves one-hot encoding of each word in a sentence.
For example, we have a sentence "Python training Goeduhub Technologies "
In the above diagram: First, we created a one-hot vector of each word in a sentence.
After that, we consider the first three words that are "Python training Goeduhub" Where we are trying to predict the middle word based on context, that is "Python and Goeduhub".
We get a Predicted output of the word "training" and we will try to match this output to our actual output (one-hot vector of "training") and will update weights for good accuracy or a good match.
We do this process, making a continuous bag of word vectors that is continuous bag of the word (CBOW). For the above example First BOW - Python training Goeduhub second - training Goeduhub Technologies like this.
Skip-Gram
A skip-gram algorithm predicts the context based on the surrounding middle word.
For example,
jump -- predict-- The quick brown fox over the lazy dog.
In the above diagram, we are predicting the context based on the center word. which is predicting Goeduhub, and Python based on Center word "training".
In the end, the softmax function is used to maximize the probability of predicting context. For a sequence of words w1, w2, ... wT, the objective can be written as the average log probability.
where c is the size of the training context. The basic skip-gram formulation defines this probability using the softmax function.
where v and v' are target and context vector representations of words and W is vocabulary size.
Note: Word2Vec comprises these two algorithms to give a fixed dimension numerical representation of words in a context and relation between words.