Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859233788


##
LogicalTypes.md:
##
@@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated 
group nor annotated
 by `LIST` or `MAP` should be interpreted as a required list of required
 elements where the element type is the type of the field.
 
-Implementations should use either `LIST` and `MAP` annotations _or_ unannotated
-repeated fields, but not both. When using the annotations, no unannotated
-repeated types are allowed.
+```
+// List (non-null list, non-null elements)
+repeated int32 num;
+
+// List> (non-null list, non-null elements)
+repeated group my_list {
+  required int32 num;
+  optional binary str (STRING);
+}
+```
+
+For all fields in the schema, implementations should use either `LIST` and

Review Comment:
   -0 on this change. I don't think this is more clear and I would prefer not 
to have the churn.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859240321


##
LogicalTypes.md:
##
@@ -684,44 +702,67 @@ optional group my_list (LIST) {
 }
 ```
 
-Some existing data does not include the inner element layer. For
-backward-compatibility, the type of elements in `LIST`-annotated structures
-should always be determined by the following rules:
+# 2-level structure
+
+Some existing data does not include the inner element layer, resulting in a
+`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the
+repetition of a 2-level structure can be `optional`, `required`, or `repeated`.
+When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as
+an element within another `LIST`-annotated 2-level structure.
+
+```
+ group  (LIST) {
+  repeated  ;
+}

Review Comment:
   Again, I think that calling attention to the degenerate cases and 
documenting them is only going to cause more confusion. The purpose of this 
originally was to simply document how to interpret data that doesn't match 
expectations. Now this introduces how a 2-level list looks, which I think 
increases the possibility that people will misread this and write them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859237857


##
LogicalTypes.md:
##
@@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated 
group nor annotated
 by `LIST` or `MAP` should be interpreted as a required list of required
 elements where the element type is the type of the field.
 
-Implementations should use either `LIST` and `MAP` annotations _or_ unannotated
-repeated fields, but not both. When using the annotations, no unannotated
-repeated types are allowed.
+```
+// List (non-null list, non-null elements)
+repeated int32 num;
+
+// List> (non-null list, non-null elements)
+repeated group my_list {
+  required int32 num;
+  optional binary str (STRING);
+}

Review Comment:
   I think this example is counter-productive. We don't want anyone using 
un-annotated lists and maps. While the paragraph above explains how to 
interpret un-annotated `repeated` fields, I don't want anyone to see an example 
here and think that it is something that should be copied. I think it is 
already clear enough and I would simply move on rather than drawing attention 
to this as a possibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]

2024-11-26 Thread via GitHub


wgtmac commented on PR #3072:
URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2501022328

   Usually we need two reference implementations for spec changes like this. 
I'm not sure if there is any chance to have another implementation ready in a 
timely manner. IMO, at least parquet-java should support basic roundtrip read 
and write.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859075924


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).

Review Comment:
   I've updated this to make it clear that this is referring to the repetition 
level. There are also examples, so I think that it is unambiguous.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859077883


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859080002


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.

Review Comment:
   Thanks! Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859083339


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:

Review Comment:
   This already points to the shredding spec in multiple places, so I think it 
is clear how to get more information about `typed_value`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859084567


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:
+```
+optional group shredded_variant_name (VARIANT) {
+  required binary metadata;
+  optional binary value;
+  optional int64 typed_value;
+}
+```
+
+The `VARIANT` annotation places no additional restrictions on the repetition 
of Variant groups, but repetition may be restricted by containing types (such 
as `MAP` and `LIST`).

Review Comment:
   I don't agree that it is considered a primitive type. And we don't need to 
in order to state that it places no additional restrictions on the repetition 
of Variant groups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859086423


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.

Review Comment:
   I think JSON makes it more confusing because these objects are not JSON and 
contain typed values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859087543


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859093933


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859095957


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859099929


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859092517


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] MINOR: Use `exec-maven-plugin.version` property [parquet-java]

2024-11-26 Thread via GitHub


Fokko merged PR #3047:
URL: https://github.com/apache/parquet-java/pull/3047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859141628


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859148222


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

[PR] MINOR: Revert `buildnumber-maven-plugin` to 3.2.0 [parquet-java]

2024-11-26 Thread via GitHub


Fokko opened a new pull request, #3082:
URL: https://github.com/apache/parquet-java/pull/3082

   ### Rationale for this change
   
   During verification of the 1.15.0 release, @gszadovszky noticed that this 
specific version caused issues, therefore it is better to revert it for now.
   
   ### What changes are included in this PR?
   
   
   ### Are these changes tested?
   
   
   ### Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859127325


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859130304


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859147187


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859151894


##
VariantEncoding.md:
##
@@ -416,14 +444,36 @@ Field names are case-sensitive.
 Field names are required to be unique for each object.
 It is an error for an object to contain two fields with the same name, whether 
or not they have distinct dictionary IDs.
 
-# Versions and extensions
+## Versions and extensions
 
 An implementation is not expected to parse a Variant value whose metadata 
version is higher than the version supported by the implementation.
 However, new types may be added to the specification without incrementing the 
version ID.
 In such a situation, an implementation should be able to read the rest of the 
Variant value if desired.
 
-# Shredding
+## Shredding
 
 A single Variant object may have poor read performance when only a small 
subset of fields are needed.
 A better approach is to create separate columns for individual fields, 
referred to as shredding or subcolumnarization.
 [VariantShredding.md](VariantShredding.md) describes the Variant shredding 
specification in Parquet.
+
+## Conversion to JSON
+
+Values stored in the Variant encoding are a superset of JSON values.
+For example, a Variant value can be a date that has no equivalent type in JSON.
+To maximize compatibility with readers that can process JSON but not Variant, 
the following conversions should be used when producing JSON from a Variant:
+
+| Variant type  | JSON type | Representation requirements  
| Example  |
+|---|---|--|--|
+| Null type | null  | `null`   
| `null`   |
+| Boolean   | boolean   | `true` or `false`
| `true`   |
+| Exact Numeric | number| Digits in fraction must match scale, no exponent 
| `34`, 34.00  |

Review Comment:
   > When an engine wants to convert a variant value to a JSON string, here are 
the rules
   
   Yes, this is correct. We want a clear way to convert to a JSON string. 
However, the normalization needs to happen first. We don't want to specify that 
the JSON must be any more lossy than it already is.
   
   Why would we require an engine to produce a normalized value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] HadoopStreams to support ByteBufferPositionedReadable input streams [parquet-java]

2024-11-26 Thread via GitHub


steveloughran opened a new issue, #3080:
URL: https://github.com/apache/parquet-java/issues/3080

   ### Describe the enhancement requested
   
   
   If a stream declares in its StreamCapabilities that it supports
   ByteBufferPositionedReadable, then use it for `readFully(ByteBuffer)`
   All streams in Hadoop 3.0.0 + do declare this.
   
   + use StreamCapabilities to look for `ByteBufferReadable`.
   
   For detecting ByteBufferReadable, use this probe falling back to the 
recursive scan.
   All streams in the hadoop codebase will report this via StreamCapabilities, 
but there
   may be some third-party streams which do not. 
   
   
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [I] HadoopStreams to support ByteBufferPositionedReadable input streams [parquet-java]

2024-11-26 Thread via GitHub


steveloughran commented on issue #3080:
URL: https://github.com/apache/parquet-java/issues/3080#issuecomment-2501825209

   I'm implementing this, with tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859139239


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859143649


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

[PR] MINOR: Add shading for JDK22 specific classes [parquet-java]

2024-11-26 Thread via GitHub


Fokko opened a new pull request, #3081:
URL: https://github.com/apache/parquet-java/pull/3081

   ### Rationale for this change
   
   JDK 22 specific classes were added in Jackson, but we forgot to shade them 
explicitly as pointed out in:
   
   
https://github.com/apache/parquet-java/blob/8fa70320a9cdeeba12a4d17ef248cd4e535f0907/pom.xml#L70
   
   ### What changes are included in this PR?
   
   
   ### Are these changes tested?
   
   
   ### Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]

2024-11-26 Thread via GitHub


aihuaxu commented on PR #3072:
URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2501372540

   I see. Per guideline, we need to have the implementation in parquet-java and 
then another one. Do we usually include the implementation with this annotation 
change or should be separate?   
   
   > Completeness: The goal of this phase is to ensure the feature is viable, 
there is no ambiguity in its specification by demonstrating compatibility 
between implementations. Once a change has lazy consensus, two implementations 
of the feature demonstrating interopability must also be provided. One 
implementation MUST be [parquet-java](http://github.com/apache/parquet-java). 
It is preferred that the second implementation be 
[parquet-cpp](https://github.com/apache/arrow) or 
[parquet-rs](https://github.com/apache/arrow-rs),


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[PR] GH-3078: Use Hadoop FileSystem.openFile() to open files [parquet-java]

2024-11-26 Thread via GitHub


steveloughran opened a new pull request, #3079:
URL: https://github.com/apache/parquet-java/pull/3079

   
   ### Rationale for this change
   
   
   ### What changes are included in this PR?
   
   
   * Open files with FileSystem.openFile(), passing in file status
   * And read policy of "parquet, vector, random, adaptive"
   
   ### Are these changes tested?
   
   Through parquet-hadoop.
   
   ### Are there any user-facing changes?
   
   no.
   
   
   Closes #3078 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-2943: Remove hadoop-2 support [parquet-java]

2024-11-26 Thread via GitHub


Fokko merged PR #3061:
URL: https://github.com/apache/parquet-java/pull/3061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [I] Remove support for Hadoop <3.3 [parquet-java]

2024-11-26 Thread via GitHub


Fokko closed issue #2943: Remove support for Hadoop <3.3
URL: https://github.com/apache/parquet-java/issues/2943


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] HadoopInputFile to pass down FileStatus when opening file [parquet-java]

2024-11-26 Thread via GitHub


steveloughran closed pull request #2955: HadoopInputFile to pass down 
FileStatus when opening file
URL: https://github.com/apache/parquet-java/pull/2955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] HadoopInputFile to pass down FileStatus when opening file [parquet-java]

2024-11-26 Thread via GitHub


steveloughran commented on PR #2955:
URL: https://github.com/apache/parquet-java/pull/2955#issuecomment-2501251041

   Superceded by #3079 now reflection is not needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859059592


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859061998


##
VariantEncoding.md:
##
@@ -416,14 +444,36 @@ Field names are case-sensitive.
 Field names are required to be unique for each object.
 It is an error for an object to contain two fields with the same name, whether 
or not they have distinct dictionary IDs.
 
-# Versions and extensions
+## Versions and extensions
 
 An implementation is not expected to parse a Variant value whose metadata 
version is higher than the version supported by the implementation.
 However, new types may be added to the specification without incrementing the 
version ID.
 In such a situation, an implementation should be able to read the rest of the 
Variant value if desired.
 
-# Shredding
+## Shredding
 
 A single Variant object may have poor read performance when only a small 
subset of fields are needed.
 A better approach is to create separate columns for individual fields, 
referred to as shredding or subcolumnarization.
 [VariantShredding.md](VariantShredding.md) describes the Variant shredding 
specification in Parquet.
+
+## Conversion to JSON
+
+Values stored in the Variant encoding are a superset of JSON values.
+For example, a Variant value can be a date that has no equivalent type in JSON.
+To maximize compatibility with readers that can process JSON but not Variant, 
the following conversions should be used when producing JSON from a Variant:
+
+| Variant type  | JSON type | Representation requirements  
| Example  |
+|---|---|--|--|
+| Null type | null  | `null`   
| `null`   |
+| Boolean   | boolean   | `true` or `false`
| `true`   |
+| Exact Numeric | number| Digits in fraction must match scale, no exponent 
| `34`, 34.00  |
+| Float | number| Fraction must be present 
| `14.20`  |
+| Double| number| Fraction must be present 
| `1.0`|
+| Date  | string| ISO-8601 formatted date  
| `"2017-11-16"`   |
+| Timestamp | string| ISO-8601 formatted UTC timestamp including 
+00:00 offset | `"2017-11-16T22:31:08.01+00:00"` |
+| TimestampNTZ  | string| ISO-8601 formatted UTC timestamp with no offset 
or zone  | `"2017-11-16T22:31:08.01"`   |

Review Comment:
   In that case, I'll require trailing 0s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859071155


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859108674


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859117065


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


wgtmac commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859989177


##
LogicalTypes.md:
##
@@ -684,44 +702,67 @@ optional group my_list (LIST) {
 }
 ```
 
-Some existing data does not include the inner element layer. For
-backward-compatibility, the type of elements in `LIST`-annotated structures
-should always be determined by the following rules:
+# 2-level structure
+
+Some existing data does not include the inner element layer, resulting in a
+`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the
+repetition of a 2-level structure can be `optional`, `required`, or `repeated`.
+When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as
+an element within another `LIST`-annotated 2-level structure.
+
+```
+ group  (LIST) {
+  repeated  ;
+}

Review Comment:
   Removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


mapleFU commented on PR #466:
URL: https://github.com/apache/parquet-format/pull/466#issuecomment-2502968117

   > The rules part is looking good, but I think that spending time documenting 
what people did incorrectly years ago makes the doc more confusing and 
increases chances that people will write invalid lists. I'd prefer to revert 
most of the changes that explain what people did incorrectly.
   
   I agree. But I think those can be posted on the pull-request description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


wgtmac commented on PR #466:
URL: https://github.com/apache/parquet-format/pull/466#issuecomment-2502982189

   @rdblue Thanks for your review! I have removed all unnecessary changes. 
Please take a look again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


wgtmac commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859998523


##
LogicalTypes.md:
##
@@ -684,44 +689,58 @@ optional group my_list (LIST) {
 }
 ```
 
-Some existing data does not include the inner element layer. For
-backward-compatibility, the type of elements in `LIST`-annotated structures
+Some existing data does not include the inner element layer, resulting in a
+`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the
+repetition of a 2-level structure can be `optional`, `required`, or `repeated`.
+When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as
+an element within another `LIST`-annotated 2-level structure.
+
+For backward-compatibility, the type of elements in `LIST`-annotated structures
 should always be determined by the following rules:
 
 1. If the repeated field is not a group, then its type is the element type and
elements are required.
 2. If the repeated field is a group with multiple fields, then its type is the
element type and elements are required.
-3. If the repeated field is a group with one field and is named either `array`
+3. If the repeated field is a group with one field with `repeated` repetition,
+   then its type is the element type and elements are required.
+4. If the repeated field is a group with one field and is named either `array`
or uses the `LIST`-annotated group's name with `_tuple` appended then the
repeated type is the element type and elements are required.
-4. Otherwise, the repeated field's type is the element type with the repeated
+5. Otherwise, the repeated field's type is the element type with the repeated

Review Comment:
   I don't want to add an example for rule 5 because it is already at line 685



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


wgtmac commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859997738


##
LogicalTypes.md:
##
@@ -684,44 +689,58 @@ optional group my_list (LIST) {
 }
 ```
 
-Some existing data does not include the inner element layer. For
-backward-compatibility, the type of elements in `LIST`-annotated structures
+Some existing data does not include the inner element layer, resulting in a
+`LIST` that annotates a 2-level structure. Unlike the 3-level structure, the
+repetition of a 2-level structure can be `optional`, `required`, or `repeated`.
+When it is `repeated`, the `LIST`-annotated 2-level structure can only serve as
+an element within another `LIST`-annotated 2-level structure.
+
+For backward-compatibility, the type of elements in `LIST`-annotated structures
 should always be determined by the following rules:
 
 1. If the repeated field is not a group, then its type is the element type and
elements are required.
 2. If the repeated field is a group with multiple fields, then its type is the
element type and elements are required.
-3. If the repeated field is a group with one field and is named either `array`
+3. If the repeated field is a group with one field with `repeated` repetition,
+   then its type is the element type and elements are required.
+4. If the repeated field is a group with one field and is named either `array`
or uses the `LIST`-annotated group's name with `_tuple` appended then the
repeated type is the element type and elements are required.
-4. Otherwise, the repeated field's type is the element type with the repeated
+5. Otherwise, the repeated field's type is the element type with the repeated

Review Comment:
   I have reverted most of the previous changes and now it should be clear. 
@etseidl @mapleFU 
   
   To resolve a LIST-annotated group, we should apply rules in order:
   - check if it is a 2-level structure (rule 1 to 3)
   - check if it is a special 2-level structure (rule 4)
   - it is a 3-level structure (rule 5)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]

2024-11-26 Thread via GitHub


wgtmac commented on PR #3072:
URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2502503713

   I think it should be in one change. The parquet-format cannot be released 
without concrete PoC implementation in parquet-java. Without that release, 
separate changes may break CI and thus cannot be merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-26 Thread via GitHub


wgtmac commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1859970898


##
LogicalTypes.md:
##
@@ -609,9 +609,20 @@ that is neither contained by a `LIST`- or `MAP`-annotated 
group nor annotated
 by `LIST` or `MAP` should be interpreted as a required list of required
 elements where the element type is the type of the field.
 
-Implementations should use either `LIST` and `MAP` annotations _or_ unannotated
-repeated fields, but not both. When using the annotations, no unannotated
-repeated types are allowed.
+```
+// List (non-null list, non-null elements)
+repeated int32 num;
+
+// List> (non-null list, non-null elements)
+repeated group my_list {
+  required int32 num;
+  optional binary str (STRING);
+}

Review Comment:
   That make sense. Let me remove these examples first. I think a followup is 
to deprecate it by moving it to the backward compatibility section and adding 
strong words to discourage writers to emit it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] MINOR: Add `doap.rdf` file for release tracking [parquet-java]

2024-11-26 Thread via GitHub


Fokko closed pull request #3001: MINOR: Add `doap.rdf` file for release tracking
URL: https://github.com/apache/parquet-java/pull/3001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3070: Add Variant logical type annotation to parquet-java [parquet-java]

2024-11-26 Thread via GitHub


Fokko commented on PR #3072:
URL: https://github.com/apache/parquet-java/pull/3072#issuecomment-2500124168

   @aihuaxu I agree with @emkornfield that the `iceberg-java` implementation 
should be able to read and write the variant type.
   
   It would also be great to drop some example parquet files in 
https://github.com/apache/parquet-testing, this will also help the adoption of 
other implementations, see 
https://github.com/apache/parquet-format/pull/456#issuecomment-2479905612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org