Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on PR #466: URL: https://github.com/apache/parquet-format/pull/466#issuecomment-2519277990 Will merge this by the end of this week if no objection. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
pitrou commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1868965095 ## LogicalTypes.md: ## @@ -684,44 +703,61 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated field's repetition. Examples that can be interpreted using these rules: ``` -// List (nullable list, non-null elements) +WARNING: writers should not produce list types like these examples! They are +just for the purpose of reading existing data for backward-compatibility. + +// Rule 1: List (nullable list, non-null elements) optional group my_list (LIST) { repeated int32 element; } -// List> (nullable list, non-null elements) +// Rule 2: List> (nullable list, non-null elements) optional group my_list (LIST) { repeated group element { required binary str (STRING); required int32 num; }; } -// List> (nullable list, non-null elements) +// Rule 3: List> (nullable outer list, non-null elements) +optional group my_list (LIST) { + repeated group array (LIST) { +repeated int32 array; + }; +} + +// Rule 4: List> (nullable list, non-null elements) optional group my_list (LIST) { repeated group array { required binary str (STRING); }; } -// List> (nullable list, non-null elements) +// Rule 4: List> (nullable list, non-null elements) optional group my_list (LIST) { repeated group my_list_tuple { required binary str (STRING); Review Comment: For the record: I would expect an example of Rule 5 below? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868977523 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. -
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868866777 ## VariantShredding.md: ## @@ -25,290 +25,318 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. -
Re: [I] `ParquetMetadata` JSON serialization is failing [parquet-java]
Fokko closed issue #3086: `ParquetMetadata` JSON serialization is failing URL: https://github.com/apache/parquet-java/issues/3086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1869037182 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. -
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1869037182 ## VariantShredding.md: ## @@ -25,290 +25,316 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. -
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
pitrou commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1869082194 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: Redundant with what? We've been explicitly listing examples for all other rules. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1869110188 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: Ok, I added one with a slightly different schema. The inner field is `optional` in the new example. Please take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-472: Add shredding version [parquet-format]
emkornfield commented on PR #474: URL: https://github.com/apache/parquet-format/pull/474#issuecomment-2516303681 CC @rdblue @gene-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-463: Add more types - time, nano timestamps, UUID to Variant spec [parquet-format]
emkornfield commented on PR #464: URL: https://github.com/apache/parquet-format/pull/464#issuecomment-2516307414 This LGTM, @RussellSpitzer any more comments. Also, CC @gene-db @rdblue in case there are any concerns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
pitrou commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1869146424 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: Ok, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] [ignore] HADOOP-19087. Release Hadoop 3.4.1: test branch [parquet-java]
steveloughran closed pull request #2996: [ignore] HADOOP-19087. Release Hadoop 3.4.1: test branch URL: https://github.com/apache/parquet-java/pull/2996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] [ignore] HADOOP-19087. Release Hadoop 3.4.1: test branch [parquet-java]
steveloughran commented on PR #2996: URL: https://github.com/apache/parquet-java/pull/2996#issuecomment-2516976339 closing; all good now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
wgtmac commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1869077618 ## LogicalTypes.md: ## @@ -684,44 +689,58 @@ optional group my_list (LIST) { } ``` -Some existing data does not include the inner element layer. For -backward-compatibility, the type of elements in `LIST`-annotated structures +Some existing data does not include the inner element layer, resulting in a +`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the +repetition of a 2-level structure can be `optional`, `required`, or `repeated`. +When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as +an element within another `LIST`-annotated 2-level structure. + +For backward-compatibility, the type of elements in `LIST`-annotated structures should always be determined by the following rules: 1. If the repeated field is not a group, then its type is the element type and elements are required. 2. If the repeated field is a group with multiple fields, then its type is the element type and elements are required. -3. If the repeated field is a group with one field and is named either `array` +3. If the repeated field is a group with one field with `repeated` repetition, + then its type is the element type and elements are required. +4. If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. -4. Otherwise, the repeated field's type is the element type with the repeated +5. Otherwise, the repeated field's type is the element type with the repeated Review Comment: @pitrou Regarding to your [question](https://github.com/apache/parquet-format/pull/466#discussion_r1868965095) on rule 5, I think it is redundant to add another one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] MINOR: Bump version to 1.16.0-SNAPSHOT [parquet-java]
wgtmac commented on PR #3097: URL: https://github.com/apache/parquet-java/pull/3097#issuecomment-2517510829 We can bump it to 1.16.0-SNAPSHOT for now. A major version bump is something serious to discuss. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] MINOR: Bump version to 1.16.0-SNAPSHOT [parquet-java]
wgtmac merged PR #3097: URL: https://github.com/apache/parquet-java/pull/3097 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3078: Use Hadoop FileSystem.openFile() to open files [parquet-java]
steveloughran commented on PR #3079: URL: https://github.com/apache/parquet-java/pull/3079#issuecomment-2517815867 shaves a HEAD request! for s3a it tells things to seek properly rather than having to guess afterwards. FWIW there's a "whole-file" read policy, we use this in hadoop itself for stuff like distcp...now I need to get the sequential policy into avro so it knows that prefetching is good, rather than a waste of IO capacity -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[PR] MINOR: Clarify offsets etc are unsigned integers [parquet-format]
emkornfield opened a new pull request, #475: URL: https://github.com/apache/parquet-format/pull/475 ### Rationale for this change We should clarify whether metadata integers are signed or unsigned. ### What changes are included in this PR? Clarify signedness for Variant types ### Do these changes have PoC implementations? Not this is still WIP -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] MINOR: Clarify offsets etc are unsigned integers [parquet-format]
emkornfield commented on PR #475: URL: https://github.com/apache/parquet-format/pull/475#issuecomment-2518262268 @gene-db is unsigned correct or should these be signed? CC @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] MINOR: Clarify offsets etc are unsigned integers [parquet-format]
gene-db commented on code in PR #475: URL: https://github.com/apache/parquet-format/pull/475#discussion_r1870179241 ## VariantEncoding.md: ## @@ -88,9 +88,9 @@ metadata |header | +---+ ``` -The metadata is encoded first with the `header` byte, then `dictionary_size` which is a little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary. +The metadata is encoded first with the `header` byte, then `dictionary_size` which is a unsigned little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary. Next, is an `offset` list, which contains `dictionary_size + 1` values. -Each `offset` is a little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`. +Each `offset` is a usigned little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`. Review Comment: ```suggestion Each `offset` is an unsigned little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`. ``` ## VariantEncoding.md: ## @@ -69,17 +69,17 @@ The entire metadata is encoded as the following diagram shows: metadata |header | +---+ | | - :dictionary_size: <-- little-endian, `offset_size` bytes + :dictionary_size: <-- unsigned little-endian, `offset_size` bytes | | +---+ | | - :offset : <-- little-endian, `offset_size` bytes + :offset : <-- unsigned little-endian, `offset_size` bytes Review Comment: NIT: ```suggestion :offset : <-- unsigned little-endian, `offset_size` bytes ``` ## VariantEncoding.md: ## @@ -88,9 +88,9 @@ metadata |header | +---+ ``` -The metadata is encoded first with the `header` byte, then `dictionary_size` which is a little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary. +The metadata is encoded first with the `header` byte, then `dictionary_size` which is a unsigned little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary. Review Comment: ```suggestion The metadata is encoded first with the `header` byte, then `dictionary_size` which is an unsigned little-endian value of `offset_size` bytes, and represents the number of string values in the dictionary. ``` ## VariantEncoding.md: ## @@ -69,17 +69,17 @@ The entire metadata is encoded as the following diagram shows: metadata |header | +---+ | | - :dictionary_size: <-- little-endian, `offset_size` bytes + :dictionary_size: <-- unsigned little-endian, `offset_size` bytes | | +---+ | | - :offset : <-- little-endian, `offset_size` bytes + :offset : <-- unsigned little-endian, `offset_size` bytes | | +---+ : +---+ | | - :offset : <-- little-endian, `offset_size` bytes + :offset : <-- unsigned little-endian, `offset_size` bytes Review Comment: NIT: ```suggestion :offset : <-- unsigned little-endian, `offset_size` bytes ``` ## VariantEncoding.md: ## @@ -313,10 +313,10 @@ array value_data | | | | +---+ ``` -An array `value_data` begins with `num_elements`, a 1-byte or 4-byte little-endian value, representing the number of elements in the array. +An array `value_data` begins with `num_elements`, a 1-byte or 4-byte unsigned little-endian value, representing the number of elements in the array. The size in bytes of `num_elements` is indicated by `is_large` in the `value_header`. Next, is a `field_offset` list. -There are `num_elements + 1` number of entries and each `field_offset` is a little-endian value of `field_offset_size` bytes. +There are `num_elements + 1` number of entries and each `field_offset` is a unsigned little-endian value of `field_offset_size` bytes. Review Comment: ```suggestion There are `num_elements + 1` number of entries and each `field_offset` is an unsigned little-endian v
Re: [PR] MINOR: Clarify offsets etc are unsigned integers [parquet-format]
emkornfield commented on PR #475: URL: https://github.com/apache/parquet-format/pull/475#issuecomment-2518422458 Thanks for the quick review @gene-db I'll merge this end of week unless there are more comments. @aihuaxu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868938364 ## VariantShredding.md: ## @@ -25,290 +25,318 @@ The Variant type is designed to store and process semi-structured data efficiently, even with heterogeneous values. Query engines encode each Variant value in a self-describing format, and store it as a group containing `value` and `metadata` binary fields in Parquet. Since data is often partially homogenous, it can be beneficial to extract certain fields into separate Parquet columns to further improve performance. -We refer to this process as **shredding**. -Each Parquet file remains fully self-describing, with no additional metadata required to read or fully reconstruct the Variant data from the file. -Combining shredding with a binary residual provides the flexibility to represent complex, evolving data with an unbounded number of unique fields while limiting the size of file schemas, and retaining the performance benefits of a columnar format. +This process is **shredding**. -This document focuses on the shredding semantics, Parquet representation, implications for readers and writers, as well as the Variant reconstruction. -For now, it does not discuss which fields to shred, user-facing API changes, or any engine-specific considerations like how to use shredded columns. -The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), and leverages the existing Parquet specification. +Shredding enables the use of Parquet's columnar representation for more compact data encoding, column statistics for data skipping, and partial projections. -At a high level, we replace the `value` field of the Variant Parquet group with one or more fields called `object`, `array`, `typed_value`, and `variant_value`. -These represent a fixed schema suitable for constructing the full Variant value for each row. +For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') FROM tbl` only needs to load field `event_ts`, and if that column is shredded, it can be read by columnar projection without reading or deserializing the rest of the `event` Variant. +Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, '$.event_type', 'string') = 'signup'`, the `event_type` shredded column metadata can be used for skipping and to lazily load the rest of the Variant. -Shredding allows a query engine to reap the full benefits of Parquet's columnar representation, such as more compact data encoding, min/max statistics for data skipping, and I/O and CPU savings from pruning unnecessary fields not accessed by a query (including the non-shredded Variant binary data). -Without shredding, any query that accesses a Variant column must fetch all bytes of the full binary buffer. -With shredding, we can get nearly equivalent performance as in a relational (scalar) data model. +## Variant Metadata -For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, ‘string’) from tbl` only needs to access `inner_field2`, and the file scan could avoid fetching the rest of the Variant value if this field was shredded into a separate column in the Parquet schema. -Similarly, for the query `select * from tbl where variant_get(variant_col, ‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` column, and only fetch/decode the full Variant value for rows that pass the filter. +Variant metadata is stored in the top-level Variant group in a binary `metadata` column regardless of whether the Variant value is shredded. -# Parquet Example +All `value` columns within the Variant must use the same `metadata`. +All field names of a Variant, whether shredded or not, must be present in the metadata. -Consider the following Parquet schema together with how Variant values might be mapped to it. -Notice that we represent each shredded field in `object` as a group of two fields, `typed_value` and `variant_value`. -We extract all homogenous data items of a certain path into `typed_value`, and set aside incompatible data items in `variant_value`. -Intuitively, incompatibilities within the same path may occur because we store the shredding schema per Parquet file, and each file can contain several row groups. -Selecting a type for each field that is acceptable for all rows would be impractical because it would require buffering the contents of an entire file before writing. +## Value Shredding -Typically, the expectation is that `variant_value` exists at every level as an option, along with one of `object`, `array` or `typed_value`. -If the actual Variant value contains a type that does not match the provided schema, it is stored in `variant_value`. -An `variant_value` may also be populated if an object can be partially represented: any fields that are present in the schema must be written to those fields, and any missing fields are written to `variant_value`. -
Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]
emkornfield commented on code in PR #461: URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868952437 ## VariantEncoding.md: ## @@ -416,14 +444,36 @@ Field names are case-sensitive. Field names are required to be unique for each object. It is an error for an object to contain two fields with the same name, whether or not they have distinct dictionary IDs. -# Versions and extensions +## Versions and extensions An implementation is not expected to parse a Variant value whose metadata version is higher than the version supported by the implementation. However, new types may be added to the specification without incrementing the version ID. In such a situation, an implementation should be able to read the rest of the Variant value if desired. -# Shredding +## Shredding A single Variant object may have poor read performance when only a small subset of fields are needed. A better approach is to create separate columns for individual fields, referred to as shredding or subcolumnarization. [VariantShredding.md](VariantShredding.md) describes the Variant shredding specification in Parquet. + +## Conversion to JSON + +Values stored in the Variant encoding are a superset of JSON values. +For example, a Variant value can be a date that has no equivalent type in JSON. +To maximize compatibility with readers that can process JSON but not Variant, the following conversions should be used when producing JSON from a Variant: + +| Variant type | JSON type | Representation requirements | Example | +|---|---|--|--| +| Null type | null | `null` | `null` | +| Boolean | boolean | `true` or `false` | `true` | +| Exact Numeric | number| Digits in fraction must match scale, no exponent | `34`, 34.00 | Review Comment: > Why would we require an engine to produce a normalized value? At least for me, I don't think it is about "requiring" and engine to produce a normalized value first. I think if an engine is reading variant and converting it to JSON, it is possibly doing so through an internal representation so it can still apply operators on top of the JSON value and possibly even storing it as an internal representation. Conversion to a string is really only an end-user visible thing. So when I read this it seems to be requiring an engine to NOT normalize which could be hard to implement for some engines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org