Embeddings

Low-dimensional representations of high-dimensional data

What are Embeddings?

Embeddings are a fundamental concept in the field of machine learning and natural language processing (NLP). At their core, embeddings are dense, low-dimensional representations of high-dimensional data, such as text or images. Unlike the original data, which might be sparse and difficult for algorithms to process efficiently, embeddings capture the essence or meaning of the data in a form that's easier for models to work with.

Embeddings are often used to transform discrete objects (like words, sentences, or entire documents) into continuous vectors. This transformation allows computational models to understand and process complex relationships between objects, such as semantic similarity or contextual relevance. By mapping data to a continuous vector space, embeddings enable a wide range of machine learning applications, from recommender systems and search engines to sophisticated NLP tasks like sentiment analysis and machine translation.

Vantage & Embeddings

Vantage is designed to efficiently store and retrieve embeddings. It is compatible with embeddings from leading AI research labs, including OpenAI and HuggingFace. This compatibility allows users to leverage state-of-the-art embedding models for a variety of applications directly within Vantage. Besides that, for users with more specific use cases, Vantage offers the possibility to upload your own embeddings.

Users can decide which embedding creation option to use during the collection creation process. Below are examples of these options and how you can use them on your own.

Vantage-Managed Embeddings

1 | OpenAI Embeddings

Vantage supports OpenAI embeddings, enabling users to use the power of OpenAI's cutting-edge language models.

Collections using OpenAI Embeddings

To create a Vantage collection utilizing OpenAI embeddings, users can take advantage of Vantage's OpenAICollection class and create an instance of it by providing the required fields. Besides the collection_id and embeddings_dimension parameters, users are required to provide the llm parameter, which specifies the actual OpenAI model. To authenticate with OpenAI, users can choose between two options: providing an llm_secret, which represents OpenAI's secret key or the ID of an already created external API key.

Example code block:

openai_collection = OpenAICollection(
    collection_id="my-openai-collection",
    embeddings_dimension=1536,
    llm="text-embedding-ada-002",
    external_account_id="YOUR_EXTERNAL_KEY_ID",
)

created_collection = vantage_client.create_collection(
    collection=openai_collection,
)
openai_collection = OpenAICollection(
    collection_id="my-openai-collection",
    embeddings_dimension=1536,
    llm="text-embedding-ada-002",
    llm_secret="YOUR_OPENAI_SECRET_KEY",
)

created_collection = vantage_client.create_collection(
    collection=openai_collection,
)

2 | HuggingFace Embeddings

Additionally, Vantage works with HuggingFace's wide range of pre-trained models, giving users the possibility to discover the ideal embeddings for their requirements.

Collections using HuggingFace Embeddings

To create a Vantage collection utilizing HuggingFace embeddings, users can take advantage of Vantage's HuggingFaceCollection class and create an instance of it by providing the required fields. Besides the collection_id and embeddings_dimension parameters, users are required to provide the external_url parameter, which represents deployed HuggingFace Model Endpoint URL. To authenticate with HuggingFace, users can choose between two options: providing an llm_secret, which represents HuggingFace's secret key or the ID of an already created external API key.

🚧

external_url parameter

To use a HuggingFace model, it must first be deployed. This can be easily done through HuggingFace Inference Endpoints, offering a straightforward way to deploy your model with just a few clicks. Once deployed, you can copy the Endpoint URL, which will serve as the value for the url parameter. The llm_secret should correspond to the secret key associated with the account from which your HuggingFace model was deployed.

Example code block:

hf_collection = HuggingFaceCollection(
    collection_id="my-hf-collection",
    embeddings_dimension=123,
    external_url="HF_ENDPOINT_URL",
    external_account_id="YOUR_EXTERNAL_KEY_ID",
)

created_collection = vantage_client.create_collection(
    collection=hf_collection,
)
hf_collection = HuggingFaceCollection(
    collection_id="my-hf-collection",
    embeddings_dimension=123,
    external_url="HF_ENDPOINT_URL",
    llm_secret="YOUR_HF_SECRET_KEY",
)

created_collection = vantage_client.create_collection(
    collection=hf_collection,
)

User-Provided Embeddings

3 | Custom Embeddings

Understanding the unique needs of various projects, Vantage also provides the capability to upload your own custom embeddings. This feature allows users to fully customize their embeddings, optimizing them for specific data types, domains, or performance requirements.

Note: Embedding vector should be a proper unit vector in order to perform the Vantage collection operations successfully.

Collections using custom user-provided embeddings

In this scenario, users can utilize Vantage's UserProvidedEmbeddingsCollection class and create an instance of it by providing the required fields, which in this case are only collection_id and embeddings_dimension.

Example code block:

upe_collection = UserProvidedEmbeddingsCollection(
    collection_id="my-upe-collection", 
  	embeddings_dimension=123,
)

created_collection = vantage_client.create_collection(
    collection=upe_collection,
)

πŸ“˜

Custom Embeddings - Full Tutorial

For a complete guide on using custom embeddings, check out our basic embedding search tutorial. It covers the entire process, from setting up your environment and preparing your data to indexing and querying your collection.