Parquet Format

The Vantage Parquet format is used to bulk upload your data via the Console UI and via the direct upload links available in the REST API.

πŸ‘

Vantage Ingestion Format

The schema and format of the documents for ingestion.

Required Fields

  • id: Physical type is BYTE_ARRAY, logical type is String
  • text: Physical type is BYTE_ARRAY, logical type is String
  • embeddings: Physical type is a list of DOUBLE

πŸ“˜

Typically you'd have only one of text or embeddings.

For more details please check Vantage Documents page.

Check Format of Parquet

Here's some Python to validate the format in a .parquet file. You can find the examples in our :github: vantage-tutorials

import pyarrow.parquet as pq

# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world.parquet')

# Get the schema
schema = parquet_file.schema

# Print columns and their types
for field in schema:
    physical_type = field.physical_type
    logical_type = field.logical_type
    print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: text, Physical Type: BYTE_ARRAY, Logical Type: String
Column: meta_product_type, Physical Type: BYTE_ARRAY, Logical Type: String
import pyarrow.parquet as pq

# Read the Parquet file metadata
parquet_file = pq.ParquetFile('hello_world_embeddings.parquet')

# Get the schema
schema = parquet_file.schema

# Print columns and their types
for field in schema:
    physical_type = field.physical_type
    logical_type = field.logical_type
    print(f"Column: {field.path}, Physical Type: {physical_type}, Logical Type: {logical_type}")
Column: id, Physical Type: BYTE_ARRAY, Logical Type: String
Column: embeddings.list.element, Physical Type: DOUBLE, Logical Type: None