Parquet Format
The Vantage Parquet format is used to bulk upload your data via the Console UI and via the direct upload links available in the REST API.
The schema and format of the documents for ingestion.
Required Fields
id
: Physical type is BYTE_ARRAY, logical type isString
text
: Physical type is BYTE_ARRAY, logical type isString
embeddings
: Physical type is a list of DOUBLE
Optional Fields
operation
: Specifies the action to be performed on the document. The available options:update
,add
,delete
meta_[...]
: Support querying and filteringmeta_ordered_[...]
: Support sortingvariants
: Describe variants of a document
Typically you'd have only one of
text
orembeddings
.For more details please check Vantage Documents page.
Check Format of Parquet
Here's some Python to validate the format in a .parquet
file. You can find the examples in our :github: vantage-tutorials
import pyarrow.parquet as pq
# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world.parquet')
# Get the schema
schema = parquet_file.schema
# Print columns and their types
for field in schema:
physical_type = field.physical_type
logical_type = field.logical_type
print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: text, Physical Type: BYTE_ARRAY, Logical Type: String
Column: meta_product_type, Physical Type: BYTE_ARRAY, Logical Type: String
import pyarrow.parquet as pq
# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world_embeddings.parquet')
# Get the schema
schema = parquet_file.schema
# Print columns and their types
for field in schema:
physical_type = field.physical_type
logical_type = field.logical_type
print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: embeddings.list.element, Physical Type: DOUBLE, Logical Type: None
Updated 4 months ago