7 min read

23/07/2024

Word embeddings : An easy-to-understand guide

EXPERT

LinkedIn, blog posts and social media are full of content that describe how word embeddings are the basis for GenAI – the cornerstone of all things AI. If you speak to a machine learning engineer, a data scientist or mathematician, you’re likely to find the concept of “word embeddings” behind much of the NLP science and Generative AI that’s all around us since late 2022 and the ripples of change it has sent through the world as we knew it.

If back in 2021 we predicted that “AI will read text to discover information for you”, this assertion was based on the understanding of how word embeddings worked and the first experiments

Word embeddings are numerical representations of words in a high-dimensional vector space. They capture semantic relationships between words based on their usage patterns in large text corpora. But not everybody has an engineering or mathematical background to understand

As a translator, you understand that words have complex meanings and relationships. Word embeddings are a way to represent these complexities mathematically, which helps computers process and understand language more like humans do.

Imagine a vast, multidimensional space where each word in a language is represented by a unique point. This point is defined by a list of numbers (a vector). Words with similar meanings or usage patterns end up closer together in this space.

For example, "dog" and "cat" might be relatively close in this space because they're both common pets. "Feline" would be very close to "cat," while "automobile" would be far from both.

These representations are created by analyzing huge amounts of text (corpora) to see how words are used in context. If two words often appear in similar contexts, the computer assumes they're related and positions them closer in this mathematical space.

For a translator, this concept is valuable because:

It helps capture nuances in meaning that might exist between languages.
It can suggest synonyms or related words that might be useful in translation.
It forms the basis for many modern machine translation systems.

Understanding word embeddings can give you insight into how machine translation tools work and why they make certain choices in translations.

These are some basic points for further reading about word embeddings:

Vector representation: Each word is represented as a dense vector of real numbers.
Semantic similarity: Words with similar meanings are positioned closer together in the vector space.
Dimensionality: Typically range from 50 to 300 dimensions, allowing for rich representation of word relationships.
Uses: Common in natural language processing tasks like machine translation, sentiment analysis, and text classification.
Training methods: Can be created using techniques like Word2Vec, GloVe, or FastText.
Analogies: Can capture semantic relationships, allowing for operations like "king - man + woman = queen".

How are word embeddings trained?

Word embeddings are typically trained on large corpora of text. The underlying principle is that words appearing in similar contexts tend to have similar meanings.

There are two popular training methods:

Continuous Bag of Words (CBOW): Predicts a target word based on its context words.
Skip-gram: Predicts context words given a target word.

Properties and capabilities of word embeddings

Compositionality: Word vectors can be combined (e.g., by averaging) to represent phrases or sentences. This is a great feature to go beyond word level to sentence or paragraph level, thus “conveying a message”.
Cross-lingual embeddings: This is very interesting for machine translation or to transfer knowledge from one language into another as word embedding can map words from different languages into a shared vector space (let’s say the concept of “car” and “coche” in European Spanish or “carro” in Latin American Spanish, the concept of “automobile” and “automóvil”, “means of transport” and “medio de transporte”, etc)
Handling out-of-vocabulary words: Some models like FastText can generate embeddings for unseen words based on subword information. Great when you face new words like “Fitfluencer”

Limitations of word embeddings

Polysemy: Standard word embeddings struggle with words that have multiple meanings. the phenomenon of words having multiple distinct meanings.

Traditional word embedding models like Word2Vec or GloVe assign a single vector to each word, regardless of its potential multiple meanings. This approach leads to meaning conflation, where the vector becomes an averaged representation of all possible senses of the word. As a result, the embedding may not accurately represent any single meaning, diluting the semantic precision of the representation.

The issue is compounded by the context-insensitive nature of these embeddings. In natural language, the intended meaning of a polysemous word is often determined by its surrounding context. Standard word embeddings, however, don't account for this contextual information, leading to potential misinterpretations in downstream applications.

Consider words like "bank," which could refer to a financial institution or the edge of a river, or "plant," which might mean vegetation or a factory. In these cases, the word embedding struggles to differentiate between these distinct meanings, potentially leading to errors in tasks such as machine translation, information retrieval, or sentiment analysis where understanding the correct sense of a word is crucial.

This limitation can have significant quantitative effects on the performance of NLP models. Research has shown that the accuracy of word embedding models often drops markedly for polysemous words compared to monosemous (single-meaning) words. This decrease in performance can cascade through various NLP tasks like machine translation or sentiment analysis, affecting the overall reliability and effectiveness of systems relying on these embeddings. This static nature is a prominent problem in traditional embeddings because they assign a fixed vector to each word, regardless of context.

But this limitation remains an active area of research in the field of natural language processing. and that is why glossaries and a glossary function and terminology management remain an area of expertise in machine translation for translation companies and translators.

Positional Encodings:

Positional encodings are still added to the token embeddings to incorporate the order of tokens in the sequence, allowing the model to understand the structure of the text.

Transformer Layers:

The embedded tokens (plus positional encodings) pass through multiple transformer layers, where self-attention mechanisms enable the model to consider the entire context of a sequence, enhancing the embeddings' contextual relevance.

Advanced techniques to overcome those initial limitations: Retrofitting to enhance word embeddings with external knowledge

Retrofitting is a sophisticated technique in the field of natural language processing that aims to refine pre-trained word embeddings by incorporating information from external knowledge sources. This method addresses some of the inherent limitations of standard word embeddings, particularly their difficulty in handling polysemy and lack of explicit semantic or relational information.

At its core, retrofitting adjusts the vectors of pre-trained word embeddings to better align with semantic relationships defined in external lexical resources. These resources can include comprehensive linguistic databases like WordNet or FrameNet, or even domain-specific ontologies. The process begins with pre-trained word embeddings, such as those generated by popular algorithms like Word2Vec, GloVe, or FastText. These initial embeddings capture distributional semantics based on word co-occurrences in large text corpora.

The retrofitting procedure then utilizes a semantic lexicon or knowledge base that defines relationships between words. This external resource provides structured information about word meanings and connections that may not be fully captured by distributional methods alone. The algorithm iteratively updates the word vectors, bringing semantically related words closer together in the vector space while maintaining similarity to their original embeddings.

Mathematically, retrofitting typically involves minimizing a cost function that balances two primary objectives. The first is to keep the retrofitted vectors close to their original pre-trained values, preserving the valuable distributional information learned from large text corpora. The second is to ensure that words connected in the semantic resource have similar vector representations, thereby incorporating the structured knowledge into the embedding space.

This approach offers several advantages over standard word embeddings. First, it improves semantic accuracy by capturing more nuanced word relationships that are explicitly defined in the knowledge resource. This can lead to better performance in various natural language processing tasks, especially those requiring fine-grained semantic understanding.

Second, retrofitting facilitates domain adaptation. General-purpose embeddings can be tailored to specific domains by using domain-specific knowledge resources, making them more relevant and accurate for specialized applications. This is particularly useful in fields like medicine, law, or finance, where terminology and word usage can be highly specialized.

Third, retrofitting can potentially improve representations for rare words. These words often have poor representations in standard embeddings due to limited occurrences in the training corpus. By leveraging external knowledge, retrofitting can enhance the quality of these representations, leading to better handling of uncommon terms.

Lastly, retrofitting preserves the valuable distributional information learned from large text corpora while adding structured knowledge. This combination of data-driven and knowledge-based approaches results in embeddings that benefit from both statistical patterns in language use and curated semantic information.

In conclusion, retrofitting represents a powerful technique for enhancing word embeddings, bridging the gap between purely distributional methods and structured knowledge resources. As natural language processing continues to advance, techniques like retrofitting play a crucial role in developing more sophisticated and semantically rich representations of language.

Retrofitting: Challenges and Future Directions

Retrofitting word embeddings has emerged as a powerful technique for enhancing the semantic richness of distributional word representations. However, like any advanced method in natural language processing, it comes with its own set of challenges and limitations that researchers and practitioners must navigate.

One of the primary concerns in retrofitting is the quality of the knowledge source used. The effectiveness of the retrofitting process is intrinsically tied to the comprehensiveness, accuracy, and relevance of the external knowledge base employed. If the knowledge source is incomplete, outdated, or contains errors, these shortcomings can propagate into the retrofitted embeddings. This dependency underscores the importance of carefully selecting and vetting knowledge sources, especially when working in specialized domains or multilingual contexts.

Another consideration is the computational cost associated with retrofitting. While pre-trained embeddings are readily available and can be used off-the-shelf, retrofitting introduces an additional step in the embedding preparation pipeline. This process can be computationally intensive, particularly when dealing with large vocabularies or complex knowledge graphs. The increased computational requirements may pose challenges in resource-constrained environments or when rapid deployment is necessary.

Despite the improvements offered by retrofitting, the resulting embeddings still retain a fundamental limitation of traditional word embeddings: their static nature. Retrofitted embeddings, like their non-retrofitted counterparts, assign a fixed vector to each word, regardless of context. This approach does not fully address the challenge of polysemy or context-dependent meaning. Words with multiple senses or usage patterns are still represented by a single vector, which may not capture the full spectrum of their semantic nuances.

Nonetheless, retrofitted embeddings have demonstrated tangible improvements across various natural language processing tasks. In semantic similarity judgments, they often exhibit better correlation with human ratings, capturing nuanced relationships between words more accurately. Word sense disambiguation tasks benefit from the additional semantic information incorporated through retrofitting, allowing for more precise differentiation between multiple word senses. In named entity recognition, retrofitted embeddings can leverage external knowledge to better represent proper nouns and domain-specific terminology. Text classification tasks also show improvements, particularly when the classification relies on fine-grained semantic distinctions.

Looking ahead, the field of retrofitting continues to evolve, with several promising research directions. One area of focus is the effective combination of multiple knowledge sources. Researchers are exploring ways to integrate information from diverse lexical resources, ontologies, and knowledge graphs to create more comprehensive and robust retrofitted embeddings. This approach aims to leverage the strengths of different knowledge sources while mitigating their individual limitations.

Another exciting avenue is the development of dynamic retrofitting techniques. These methods seek to address the static nature of traditional embeddings by adapting the retrofitting process to context. The goal is to create embeddings that can flexibly represent words based on their usage in specific contexts, potentially resolving ambiguities and capturing subtle meaning variations more effectively.

Furthermore, there is ongoing work to integrate retrofitting concepts with more advanced embedding models such as BERT or GPT. These contextualized embedding models have revolutionized many NLP tasks, and researchers are exploring ways to incorporate external knowledge into these architectures. This integration could potentially combine the strengths of deep contextual representations with the structured semantic information provided by retrofitting.

Retrofitting represents a significant step towards bridging the gap between purely distributional methods of word representation and more structured approaches to capturing semantic meaning in natural language processing. While challenges remain, the ongoing research and development in this area promise to yield even more powerful and nuanced word representations, further advancing our ability to process and understand natural language in increasingly sophisticated ways.