emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1899171394
##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
The Variant type is designed to store and process semi-structured data
efficiently, even with heterogeneous values.
Query engines encode each Variant value in a self-describing format, and store
it as a group containing `value` and `metadata` binary fields in Parquet.
Since data is often partially homogenous, it can be beneficial to extract
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to
represent complex, evolving data with an unbounded number of unique fields
while limiting the size of file schemas, and retaining the performance benefits
of a columnar format.
+This process is **shredding**.
-This document focuses on the shredding semantics, Parquet representation,
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes,
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md),
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more
compact data encoding, column statistics for data skipping, and partial
projections.
-At a high level, we replace the `value` field of the Variant Parquet group
with one or more fields called `object`, `array`, `typed_value`, and
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp')
FROM tbl` only needs to load field `event_ts`, and if that column is shredded,
it can be read by columnar projection without reading or deserializing the rest
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event,
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column
metadata can be used for skipping and to lazily load the rest of the Variant.
-Shredding allows a query engine to reap the full benefits of Parquet's
columnar representation, such as more compact data encoding, min/max statistics
for data skipping, and I/O and CPU savings from pruning unnecessary fields not
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational
(scalar) data model.
+## Variant Metadata
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’,
‘string’) from tbl` only needs to access `inner_field2`, and the file scan
could avoid fetching the rest of the Variant value if this field was shredded
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col,
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id`
column, and only fetch/decode the full Variant value for rows that pass the
filter.
+Variant metadata is stored in the top-level Variant group in a binary
`metadata` column regardless of whether the Variant value is shredded.
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the
metadata.
-Consider the following Parquet schema together with how Variant values might
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store
the shredding schema per Parquet file, and each file can contain several row
groups.
-Selecting a type for each field that is acceptable for all rows would be
impractical because it would require buffering the contents of an entire file
before writing.
+## Value Shredding
-Typically, the expectation is that `variant_value` exists at every level as an
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially
represented: any fields that are present in the schema must be written to those
fields, and any missing fields are written to `variant_value`.
-