Back to Glossary
Natural Language Processing

Word2Vec

Word2Vec is a group of models used to generate word embeddings, which are vector representations of words. These embeddings capture semantic relationships between words based on their context in a large corpus of text.

Explanation

Word2Vec, developed by Tomas Mikolov and his team at Google, is a pivotal technique in natural language processing for learning word embeddings. It operates on the principle that words appearing in similar contexts are likely to have similar meanings. Word2Vec offers two primary architectures: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW predicts a target word based on its surrounding context words, while Skip-gram does the opposite, predicting surrounding context words given a target word. The training process involves feeding the model a large corpus of text and adjusting the weights of a neural network to optimize the prediction task. The resulting weights associated with the input words become the word embeddings. These embeddings are low-dimensional vectors (e.g., 100-300 dimensions) that represent words in a continuous vector space. Crucially, the geometric relationships between these vectors reflect semantic relationships between the corresponding words. For example, the vectors for 'king' and 'queen' might be closer to each other than the vectors for 'king' and 'table'. Word2Vec embeddings are used as input features for many downstream NLP tasks, such as text classification, machine translation, and sentiment analysis, significantly improving their performance by providing rich semantic information about words.

Related Terms