Documents

Vantage Ingestion Format

The documents (or records) you ingest or upload into your collections must conform to specific field and format requirements. This is what we call the Vantage Ingestion Format (VIF). You must use this schema and conventions for both JSONL format and Parquet format when uploading via the Console UI, an upload file, or the upload documents API. All field names are lower case, except where noted.

Vantage Document


Required Fields

id

This represents your ID for this document. It will be handed back to you in search results.

  • Length: 1 to 256 characters
{
 "id": "http://mydomain.com/products/1234567"
}
{
 "id": "product-id-12345" 
}
{
  "id": "9938173" 
}

text

This field is the text that will be embedded using your provided model. It should be a UTF-8 string, zero to 1GB in size.

embeddings

This is an array of 32-bit floating point numbers (4 bytes each). The array length should match the Dimension Size of the collection you're putting data into. For instance, if you're using OpenAI's text-embedding-ada-002 model which has a dimension size of 1536, your embeddings array length should be 1536.

👍

Vantage Managed Embeddings (VME) or User Provided Embeddings (UPE)

During ingestion you must include either text or embeddings depending on if Vantage is managing embeddings for you or you are providing them.

{
  "id" : "123",
  "text" : "The quick brown fox jumps over the lazy dog"
}
{
  "id" : "document-100",
  "embeddings" : [-0.010061902925372124, -0.017514921724796295, .... ]
}

🚧

File with both text and embeddings

If your file includes both, the embeddings field takes precedence and we will use it instead of re-embedding the text. That means, even if you are using Vantatge Managed Embeddings and normally the text would be processed, the embeddings would be used instead. This may be useful if you have already created embeddings for the data your are ingesting to the Vantage Platform.


Optional Fields

operation

This field specifies the action to be performed on the document. The available options are:

  • delete: Deletes document.
  • update: Updates values of the document.
  • add: Adds new document, if id of the document already exists, add will act as update operation.

By default, the update option is used.

📘

delete operation requirements

If the delete operation is set, the only other required field is id.
During delete, only one operation (delete) should be applied across all documents in the same ingestion file.

meta_ fields

Other fields, prefixed with meta_, can be provided to support querying and filtering. A document can have any number of these fields.

meta_<fieldname> is a field that should be indexed for search query filtering. The <fieldname> is case-sensitive. meta_FieldName is different from meta_fieldname.

The <fieldname> part of the field has specific naming restrictions:

  • Characters: May contain only [a-zA-Z0-9-_] characters
  • Minimum length: 3 characters
  • Maximum length: 255 characters

The values of these fields can only be these types:

  • Scalar Single Values: meta_<fieldname> : int, string, float
  • Array or List Values: meta_<fieldname> : [int], [string], [float]

📘

The meta_ prefix is only used in the ingestion format. During querying, the meta_ prefix is dropped.

meta_ordered_ fields

In addition to meta_ fields, there are meta_ordered_ fields, which adhere to the same guidelines as meta_ fields but serve an additional purpose: sorting search results.

meta_ordered_<fieldname> is a field that should be indexed for sorting search query results. The <fieldname> is case-sensitive. meta_FieldName is different from meta_fieldname.

The <fieldname> part of the field has specific naming restrictions:

  • Characters: May contain only [a-zA-Z0-9-_] characters

These fields are specifically designed for organizing results based on their values. For detailed instructions on utilizing meta_ordered_ fields, please refer to Search Options page.

📘

The meta_ordered_ prefix is only used in the ingestion format. During querying, the meta_ordered_ prefix is dropped.