Parquet Format

The Vantage Parquet format is used to bulk upload your data via the Console UI and via the direct upload links available in the REST API.

The contents of the file must adhere to the Vantage Ingestion Format. You will find specific examples here on how to create and validate a .parquet file that implements the format needed to ingest into the Vantage platform.

Required Fields

  • id: Physical type is BYTE_ARRAY, logical type is String
  • text: Physical type is BYTE_ARRAY, logical type is String
  • embeddings: Physical type is a list of DOUBLE

πŸ‘

Typically you'd have only one of text or embeddings.

Check Format of Parquet

Here's some Python to validate the format in a .parquet file. You can find the examples in our :github: vantage-tutorials

import pyarrow.parquet as pq

# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world.parquet')

# Get the schema
schema = parquet_file.schema

# Print columns and their types
for field in schema:
    physical_type = field.physical_type
    logical_type = field.logical_type
    print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: text, Physical Type: BYTE_ARRAY, Logical Type: String
Column: meta_product_type, Physical Type: BYTE_ARRAY, Logical Type: String
import pyarrow.parquet as pq

# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world_embeddings.parquet')

# Get the schema
schema = parquet_file.schema

# Print columns and their types
for field in schema:
    physical_type = field.physical_type
    logical_type = field.logical_type
    print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: embeddings.list.element, Physical Type: DOUBLE, Logical Type: None