Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
rdblue commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859233788 ## LogicalTypes.md: ## @@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. -Implementations should use either `LIST` and `MAP` annotations _or_ unannotated -repeated fields, but not both. When using the annotations, no unannotated -repeated types are allowed. +``` +// List (non-null list, non-null elements) +repeated int32 num; + +// List> (non-null list, non-null elements) +repeated group my_list { + required int32 num; + optional binary str (STRING); +} +``` + +For all fields in the schema, implementations should use either `LIST` and Review Comment: -0 on this change. I don't think this is more clear and I would prefer not to have the churn. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
rdblue commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859240321 ## LogicalTypes.md: ## @@ -684,44 +702,67 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures -should always be determined by the following rules: +# 2-level structure + +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +``` + group (LIST) { + repeated ; +} Review Comment: Again, I think that calling attention to the degenerate cases and documenting them is only going to cause more confusion. The purpose of this originally was to simply document how to interpret data that doesn't match expectations. Now this introduces how a 2-level list looks, which I think increases the possibility that people will misread this and write them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
rdblue commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859237857 ## LogicalTypes.md: ## @@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. -Implementations should use either `LIST` and `MAP` annotations _or_ unannotated -repeated fields, but not both. When using the annotations, no unannotated -repeated types are allowed. +``` +// List (non-null list, non-null elements) +repeated int32 num; + +// List> (non-null list, non-null elements) +repeated group my_list { + required int32 num; + optional binary str (STRING); +} Review Comment: I think this example is counter-productive. We don't want anyone using un-annotated lists and maps. While the paragraph above explains how to interpret un-annotated `repeated` fields, I don't want anyone to see an example here and think that it is something that should be copied. I think it is already clear enough and I would simply move on rather than drawing attention to this as a possibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]
wgtmac commented on PR #3072: URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2501022328 Usually we need two reference implementations for spec changes like this. I'm not sure if there is any chance to have another implementation ready in a timely manner. IMO, at least parquet-java should support basic roundtrip read and write. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859075924 ## VariantEncoding.md: ## @@ -39,13 +39,41 @@ Another motivation for the representation is that (aside from metadata) each nes For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. This document describes the Variant Binary Encoding scheme. -[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme. +The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns. + +## Variant in Parquet -# Variant in Parquet A Variant value in Parquet is represented by a group with 2 fields, named `value` and `metadata`. -Both fields `value` and `metadata` are of type `binary`, and cannot be `null`. -# Metadata encoding +* The Variant group must be annotated with the `VARIANT` logical type. +* Both fields `value` and `metadata` must be of type `binary` (called `BYTE_ARRAY` in the Parquet thrift definition). +* The `metadata` field is required and must be a valid Variant metadata, as defined below. +* The `value` field is required for unshredded Variant values. +* The `value` field is optional when parts of the Variant value are shredded according to the [Variant Shredding specification](VariantShredding.md). Review Comment: I've updated this to make it clear that this is referring to the repetition level. There are also examples, so I think that it is unambiguous. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859077883 ## VariantShredding.md: ## @@ -25,276 +25,302 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of of Parquet's columnar representation for more compact data encoding, the use of column statistics for data skipping, and partial projections from Parquet's columnar layout. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859080002 ## VariantEncoding.md: ## @@ -39,13 +39,41 @@ Another motivation for the representation is that (aside from metadata) each nes For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. This document describes the Variant Binary Encoding scheme. -[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme. +The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns. Review Comment: Thanks! Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859083339 ## VariantEncoding.md: ## @@ -39,13 +39,41 @@ Another motivation for the representation is that (aside from metadata) each nes For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. This document describes the Variant Binary Encoding scheme. -[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme. +The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns. + +## Variant in Parquet -# Variant in Parquet A Variant value in Parquet is represented by a group with 2 fields, named `value` and `metadata`. -Both fields `value` and `metadata` are of type `binary`, and cannot be `null`. -# Metadata encoding +* The Variant group must be annotated with the `VARIANT` logical type. +* Both fields `value` and `metadata` must be of type `binary` (called `BYTE_ARRAY` in the Parquet thrift definition). +* The `metadata` field is required and must be a valid Variant metadata, as defined below. +* The `value` field is required for unshredded Variant values. +* The `value` field is optional when parts of the Variant value are shredded according to the [Variant Shredding specification](VariantShredding.md). +* When present, the `value` field must be a valid Variant value, as defined below. + +This is the expected unshredded representation in Parquet: + +``` +optional group variant_name (VARIANT) { + required binary metadata; + required binary value; +} +``` + +This is an example representation of a shredded Variant in Parquet: Review Comment: This already points to the shredding spec in multiple places, so I think it is clear how to get more information about `typed_value`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859084567 ## VariantEncoding.md: ## @@ -39,13 +39,41 @@ Another motivation for the representation is that (aside from metadata) each nes For example, in a Variant containing an Array of Variant values, the representation of an inner Variant value, when paired with the metadata of the full variant, is itself a valid Variant. This document describes the Variant Binary Encoding scheme. -[VariantShredding.md](VariantShredding.md) describes the details of the Variant shredding scheme. +The [Variant Shredding specification](VariantShredding.md) describes the details of shredding Variant values as typed Parquet columns. + +## Variant in Parquet -# Variant in Parquet A Variant value in Parquet is represented by a group with 2 fields, named `value` and `metadata`. -Both fields `value` and `metadata` are of type `binary`, and cannot be `null`. -# Metadata encoding +* The Variant group must be annotated with the `VARIANT` logical type. +* Both fields `value` and `metadata` must be of type `binary` (called `BYTE_ARRAY` in the Parquet thrift definition). +* The `metadata` field is required and must be a valid Variant metadata, as defined below. +* The `value` field is required for unshredded Variant values. +* The `value` field is optional when parts of the Variant value are shredded according to the [Variant Shredding specification](VariantShredding.md). +* When present, the `value` field must be a valid Variant value, as defined below. + +This is the expected unshredded representation in Parquet: + +``` +optional group variant_name (VARIANT) { + required binary metadata; + required binary value; +} +``` + +This is an example representation of a shredded Variant in Parquet: +``` +optional group shredded_variant_name (VARIANT) { + required binary metadata; + optional binary value; + optional int64 typed_value; +} +``` + +The `VARIANT` annotation places no additional restrictions on the repetition of Variant groups, but repetition may be restricted by containing types (such as `MAP` and `LIST`). Review Comment: I don't agree that it is considered a primitive type. And we don't need to in order to state that it places no additional restrictions on the repetition of Variant groups. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859086423 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. Review Comment: I think JSON makes it more confusing because these objects are not JSON and contain typed values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859087543 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859093933 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859095957 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859099929 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859092517 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] MINOR: Use `exec-maven-plugin.version` property [parquet-java]
Fokko merged PR #3047: URL: https://github.com/apache/parquet-java/pull/3047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859141628 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859148222 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
[PR] MINOR: Revert `buildnumber-maven-plugin` to 3.2.0 [parquet-java]
Fokko opened a new pull request, #3082: URL: https://github.com/apache/parquet-java/pull/3082 ### Rationale for this change During verification of the 1.15.0 release, @gszadovszky noticed that this specific version caused issues, therefore it is better to revert it for now. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859127325 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859130304 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859147187 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859151894 ## VariantEncoding.md: ## @@ -416,14 +444,36 @@ Field names are case-sensitive. Field names are required to be unique for each object. It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs. -# Versions and extensions +## Versions and extensions An implementation is not expected to parse a Variant value whose metadata version is higher than the version supported by the implementation. However, new types may be added to the specification without incrementing the version ID. In such a situation, an implementation should be able to read the rest of the Variant value if desired. -# Shredding +## Shredding A single Variant object may have poor read performance when only a small subset of fields are needed. A better approach is to create separate columns for individual fields, referred to as shredding or subcolumnarization. [VariantShredding.md](VariantShredding.md) describes the Variant shredding specification in Parquet. + +## Conversion to JSON + +Values stored in the Variant encoding are a superset of JSON values. +For example, a Variant value can be a date that has no equivalent type in JSON. +To maximize compatibility with readers that can process JSON but not Variant, the following conversions should be used when producing JSON from a Variant: + +| Variant type | JSON type | Representation requirements | Example | +|---|---|--|--| +| Null type | null | `null` | `null` | +| Boolean | boolean | `true` or `false` | `true` | +| Exact Numeric | number| Digits in fraction must match scale, no exponent | `34`, 34.00 | Review Comment: > When an engine wants to convert a variant value to a JSON string, here are the rules Yes, this is correct. We want a clear way to convert to a JSON string. However, the normalization needs to happen first. We don't want to specify that the JSON must be any more lossy than it already is. Why would we require an engine to produce a normalized value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] HadoopStreams to support ByteBufferPositionedReadable input streams [parquet-java]
steveloughran opened a new issue, #3080: URL: https://github.com/apache/parquet-java/issues/3080 ### Describe the enhancement requested If a stream declares in its StreamCapabilities that it supports ByteBufferPositionedReadable, then use it for `readFully(ByteBuffer)` All streams in Hadoop 3.0.0 + do declare this. + use StreamCapabilities to look for `ByteBufferReadable`. For detecting ByteBufferReadable, use this probe falling back to the recursive scan. All streams in the hadoop codebase will report this via StreamCapabilities, but there may be some third-party streams which do not. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [I] HadoopStreams to support ByteBufferPositionedReadable input streams [parquet-java]
steveloughran commented on issue #3080: URL: https://github.com/apache/parquet-java/issues/3080#issuecomment-2501825209 I'm implementing this, with tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859139239 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859143649 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
[PR] MINOR: Add shading for JDK22 specific classes [parquet-java]
Fokko opened a new pull request, #3081: URL: https://github.com/apache/parquet-java/pull/3081 ### Rationale for this change JDK 22 specific classes were added in Jackson, but we forgot to shade them explicitly as pointed out in: https://github.com/apache/parquet-java/blob/8fa70320a9cdeeba12a4d17ef248cd4e535f0907/pom.xml#L70 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]
aihuaxu commented on PR #3072: URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2501372540 I see. Per guideline, we need to have the implementation in parquet-java and then another one. Do we usually include the implementation with this annotation change or should be separate? > Completeness: The goal of this phase is to ensure the feature is viable, there is no ambiguity in its specification by demonstrating compatibility between implementations. Once a change has lazy consensus, two implementations of the feature demonstrating interopability must also be provided. One implementation MUST be [parquet-java](http://github.com/apache/parquet-java). It is preferred that the second implementation be [parquet-cpp](https://github.com/apache/arrow) or [parquet-rs](https://github.com/apache/arrow-rs), -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[PR] GH-3078: Use Hadoop FileSystem.openFile() to open files [parquet-java]
steveloughran opened a new pull request, #3079: URL: https://github.com/apache/parquet-java/pull/3079 ### Rationale for this change ### What changes are included in this PR? * Open files with FileSystem.openFile(), passing in file status * And read policy of "parquet, vector, random, adaptive" ### Are these changes tested? Through parquet-hadoop. ### Are there any user-facing changes? no. Closes #3078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-2943: Remove hadoop-2 support [parquet-java]
Fokko merged PR #3061: URL: https://github.com/apache/parquet-java/pull/3061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [I] Remove support for Hadoop <3.3 [parquet-java]
Fokko closed issue #2943: Remove support for Hadoop <3.3 URL: https://github.com/apache/parquet-java/issues/2943 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] HadoopInputFile to pass down FileStatus when opening file [parquet-java]
steveloughran closed pull request #2955: HadoopInputFile to pass down FileStatus when opening file URL: https://github.com/apache/parquet-java/pull/2955 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] HadoopInputFile to pass down FileStatus when opening file [parquet-java]
steveloughran commented on PR #2955: URL: https://github.com/apache/parquet-java/pull/2955#issuecomment-2501251041 Superceded by #3079 now reflection is not needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859059592 ## VariantShredding.md: ## @@ -25,276 +25,302 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of of Parquet's columnar representation for more compact data encoding, the use of column statistics for data skipping, and partial projections from Parquet's columnar layout. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859061998 ## VariantEncoding.md: ## @@ -416,14 +444,36 @@ Field names are case-sensitive. Field names are required to be unique for each object. It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs. -# Versions and extensions +## Versions and extensions An implementation is not expected to parse a Variant value whose metadata version is higher than the version supported by the implementation. However, new types may be added to the specification without incrementing the version ID. In such a situation, an implementation should be able to read the rest of the Variant value if desired. -# Shredding +## Shredding A single Variant object may have poor read performance when only a small subset of fields are needed. A better approach is to create separate columns for individual fields, referred to as shredding or subcolumnarization. [VariantShredding.md](VariantShredding.md) describes the Variant shredding specification in Parquet. + +## Conversion to JSON + +Values stored in the Variant encoding are a superset of JSON values. +For example, a Variant value can be a date that has no equivalent type in JSON. +To maximize compatibility with readers that can process JSON but not Variant, the following conversions should be used when producing JSON from a Variant: + +| Variant type | JSON type | Representation requirements | Example | +|---|---|--|--| +| Null type | null | `null` | `null` | +| Boolean | boolean | `true` or `false` | `true` | +| Exact Numeric | number| Digits in fraction must match scale, no exponent | `34`, 34.00 | +| Float | number| Fraction must be present | `14.20` | +| Double| number| Fraction must be present | `1.0`| +| Date | string| ISO-8601 formatted date | `"2017-11-16"` | +| Timestamp | string| ISO-8601 formatted UTC timestamp including +00:00 offset | `"2017-11-16T22:31:08.01+00:00"` | +| TimestampNTZ | string| ISO-8601 formatted UTC timestamp with no offset or zone | `"2017-11-16T22:31:08.01"` | Review Comment: In that case, I'll require trailing 0s. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859071155 ## VariantShredding.md: ## @@ -25,276 +25,302 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of of Parquet's columnar representation for more compact data encoding, the use of column statistics for data skipping, and partial projections from Parquet's columnar layout. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and shredding can enable columnar projection that ignores the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859108674 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
rdblue commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859117065 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. - -The
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859989177 ## LogicalTypes.md: ## @@ -684,44 +702,67 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures -should always be determined by the following rules: +# 2-level structure + +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +``` + group (LIST) { + repeated ; +} Review Comment: Removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
mapleFU commented on PR #466: URL: https://github.com/apache/parquet-format/pull/466#issuecomment-2502968117 > The rules part is looking good, but I think that spending time documenting what people did incorrectly years ago makes the doc more confusing and increases chances that people will write invalid lists. I'd prefer to revert most of the changes that explain what people did incorrectly. I agree. But I think those can be posted on the pull-request description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on PR #466: URL: https://github.com/apache/parquet-format/pull/466#issuecomment-2502982189 @rdblue Thanks for your review! I have removed all unnecessary changes. Please take a look again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859998523 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: I don't want to add an example for rule 5 because it is already at line 685 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859997738 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: I have reverted most of the previous changes and now it should be clear. @etseidl @mapleFU To resolve a LIST-annotated group, we should apply rules in order: - check if it is a 2-level structure (rule 1 to 3) - check if it is a special 2-level structure (rule 4) - it is a 3-level structure (rule 5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]
wgtmac commented on PR #3072: URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2502503713 I think it should be in one change. The parquet-format cannot be released without concrete PoC implementation in parquet-java. Without that release, separate changes may break CI and thus cannot be merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859970898 ## LogicalTypes.md: ## @@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. -Implementations should use either `LIST` and `MAP` annotations _or_ unannotated -repeated fields, but not both. When using the annotations, no unannotated -repeated types are allowed. +``` +// List (non-null list, non-null elements) +repeated int32 num; + +// List> (non-null list, non-null elements) +repeated group my_list { + required int32 num; + optional binary str (STRING); +} Review Comment: That make sense. Let me remove these examples first. I think a followup is to deprecate it by moving it to the backward compatibility section and adding strong words to discourage writers to emit it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] MINOR: Add `doap.rdf` file for release tracking [parquet-java]
Fokko closed pull request #3001: MINOR: Add `doap.rdf` file for release tracking URL: https://github.com/apache/parquet-java/pull/3001 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]
Fokko commented on PR #3072: URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2500124168 @aihuaxu I agree with @emkornfield that the `iceberg-java` implementation should be able to read and write the variant type. It would also be great to drop some example parquet files in https://github.com/apache/parquet-testing, this will also help the adoption of other implementations, see https://github.com/apache/parquet-format/pull/456#issuecomment-2479905612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org