emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881054123
########## VariantShredding.md: ########## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. Review Comment: My intent was for those not fluent with Variant types to define what a "partial projection" means. I proposed an alternative formulation of the sentence below as an alternative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org