Try our custom LLM Masker
Featured Image

5 min read

05/09/2024

Vector Databases: Powering the Next Generation of AI Translation

The field of artificial intelligence and machine learning is evolving as fast as I write this article. Vector databases have emerged as a powerful tool for storing and retrieving high-dimensional data. Our weekly article dealing with the intersection of technology, AI and language translation / localization services today explores the concept of vector databases, their inner workings, and their application in our Deep Adaptive AI Translation, particularly in the context of automatic post-editing.

 

What is a Vector Database?

A vector database is a specialized database system designed to store, manage, and query high-dimensional vector data (vector embeddings) efficiently. Vector embeddings are numerical representations of data objects, such as text, images, or audio, in a high-dimensional space. Unlike traditional relational databases that store structured data in tables, vector databases are optimized for handling numerical vectors, which are ordered lists of numbers representing various features or characteristics of data points. A vector database stores vectors (fixed-length lists of numbers) along with other data items, typically using Approximate Nearest Neighbor algorithms for search functionalities.

The purpose of a vector database is storing mathematical representations of data in high-dimensional space, where each dimension corresponds to a data feature, supporting complex data representation. 

The Rise of Vector Data

As AI and machine learning applications become more prevalent across industries, they generate vast amounts of data in the form of vectors (mathematical representations of features or attributes in a multi-dimensional space and not word-based descriptions). These vectors, often containing hundreds or thousands of dimensions, pose significant challenges for traditional text-based or number-based database systems.

"Traditional databases were simply not designed to handle the complexity and scale of vector data," explains Jose Miguel Herrera, our Head of Machine Learning. "Vector databases fill a critical gap in our data infrastructure, enabling us to work with AI-generated data more efficiently and effectively, and this has an impact on the way we search and we offer information. And also, of course, the way we can produce AI translations."

Key Features of Vector Databases Driving Adoption

Vector databases bring several key features to the table that set them apart from traditional database systems. These are the main features we need to understand and that they are radically different to "usual" databases as we have known them until recently.

High-dimensional Data Storage

At the core of vector databases is their ability to efficiently store and retrieve vectors with hundreds or thousands of dimensions. This capability is crucial for applications like image recognition, natural language processing, and recommendation systems, where data points are often represented as high-dimensional vectors.

"Imagine trying to store and search through millions of images, each represented by a vector with 2,048 dimensions," says Maria Ángeles Garcia, our Head of Machine Translation. "Traditional databases would crumble under the weight of such data, but vector databases handle it elegantly and with ease. So now we can search for words and related synonyms and truly understand why a term fits or needs to fit in a particular sentence in a particular context. What's better, we can even force machine translation systems to translate in a certain way and knowledge systems to retrieve relevant information. The combination of both would create a multilingual Virtual AI Assistant - that is our ECOChat."

Similarity Search

Perhaps the most powerful feature of vector databases is their optimization for similarity searches. Unlike exact-match queries common in traditional databases, vector databases excel at finding the most similar vectors to a given query vector.

This capability opens up a world of possibilities for applications such as content recommendation, fraud detection, and semantic search. "With vector databases, we can find 'similar' items in ways that were previously impossible or prohibitively expensive," Maria explains. "In this way, we can admit words like car / vehicle or automobile in context as good translations -and not penalize them- because they fit better in context. With typical machine translation evaluations, the use of a synonym would be penalized, even if it actually improved the translation. This also helps us to interpret queries for our AI Virtual Assistant, so if someone asks "who is the boss at Pangeanic", the system will provide information about the CEO. This is absolutely wonderful when looking for information among hundreds of pages, legislation or documents." 

Scalability

Organizations keep accumulating ever-larger datasets for all types of data. Thus, scalability becomes a critical concern. Vector databases are designed from the ground up to handle large amounts of vector data and perform fast queries even as data volumes grow.

"We're seeing companies with billions of vectors in their databases, and the queries are still blazing fast," notes Jose Miguel. "This scalability is what makes vector databases a game-changer for large-scale AI applications."

Advanced Indexing Techniques

To achieve their impressive performance, vector databases employ specialized indexing techniques that dramatically speed up similarity searches. These methods, such as locality-sensitive hashing and hierarchical navigable small world graphs, allow for approximate nearest neighbor searches that are orders of magnitude faster than brute-force approaches.

"The indexing algorithms used in vector databases are a fascinating area of ongoing research," Jose Miguel says. "They're constantly evolving, pushing the boundaries of what's possible in terms of search speed and accuracy."

How Vector Databases Work

Vector databases employ various algorithms and data structures to efficiently store and query high-dimensional data. Some common techniques include:

  1. Approximate Nearest Neighbor (ANN) search: Instead of performing exact searches, which can be computationally expensive, ANN algorithms find approximate matches quickly.
    A practical application for the localization industry: In a traditional translation memory system, you often need to find the most similar existing translations to a new sentence. The more the translation memory grows, the slower the retrieval process. ANN search helps you do this quickly, even if it's not always 100% accurate.
    Example: Imagine you have a database of hundred of thousands or millions of translated sentences. When a new sentence comes in, instead of comparing it to every single sentence in the database (which would take a long time), ANN search quickly finds the most similar sentences. This might miss the absolute and perfect best match occasionally, but it's much faster and usually good enough.
    In practice: Your translation software might be using ANN already (with some tweaks) to suggest translations from your translation memory almost instantly, even with very large databases.
  2. Indexing structures: Techniques like LSH (Locality-Sensitive Hashing) or tree-based methods (e.g., KD-trees) are used to organize vectors for faster retrieval. 
    These are like advanced filing systems for your vector data, making it easier to find what you need quickly.
    A practical application for the localization industry: Let's say you're using a neural machine translation system that represents words and phrases as vectors. Indexing structures organize these vectors efficiently. It's like having a well-organized library where you can quickly find books (or in this case, translations) on related topics.
    In practice:  When your translation system is looking for similar phrases or sentences, it uses these indexing structures to narrow down the search quickly, rather than going through the entire database.
  3. Dimensionality reduction: Methods like PCA (Principal Component Analysis) or t-SNE can be used to reduce the dimensionality of vectors while preserving important information.
    A practical application for the localization industry: This is about simplifying complex data while keeping the most important information.
    Example: A neural network might represent each word in hundreds of dimensions to capture various aspects of meaning. But for some tasks, you might not need all that complexity. Dimensionality reduction is like creating a summary that captures the key points.
    In practice: You might use this when visualizing how different languages or dialects relate to each other, or when trying to identify clusters of similar content in your translation projects.
  4. Clustering: Vectors are often grouped into clusters to speed up searches and reduce the search space.
    A practical application for the localization industry: This involves grouping similar items together, which can make searches more efficient. 
    Example: In a large localization project, you might have clusters of technical terms, marketing phrases, and user interface elements. When looking for a translation, you can first identify which cluster it's most likely to belong to, then search within that cluster.
    In practice: Clustering can help in organizing your translation memories more efficiently. It can also be used in terminology management, grouping related terms together for more consistent translations.

These techniques work together in the new AI Translation workflows:

  1. When a new sentence comes in for translation, the system first uses indexing structures to efficiently search through the vast space of existing translations.
  2. It then applies ANN search to quickly find the most similar existing translations, without needing to compare against every single entry.
  3. Behind the scenes, dimensionality reduction might be used to simplify the representation of sentences or terms, making the search process even faster.
  4. The system might also be used in specific scenarios with clustering to first identify which broad category the sentence belongs to (e.g., technical documentation, marketing material), and then focus its search within that cluster.

By combining these techniques, Pangeanic's Deep Adaptive AI Translation system can provide fast, accurate suggestions and translations, even when working with enormous databases of past translations and language data. This leads to faster turnaround times, more consistent translations, and significant cost savings in administration, and management of large-scale localization projects.

 

Conclusion

Vector databases are revolutionizing the way we store and retrieve high-dimensional data, offering powerful capabilities for AI applications. In the context of Deep Adaptive AI Translation, they provide a robust foundation for improving translation quality, consistency, and style adaptation through automatic post-editing