cashmand commented on code in PR #520:
URL: https://github.com/apache/parquet-format/pull/520#discussion_r2301937925


##########
VariantShredding.md:
##########
@@ -270,9 +283,9 @@ optional group event (VARIANT) {
 
 # Data Skipping
 
-Statistics for `typed_value` columns can be used for file, row group, or page 
skipping when `value` is always null (missing).
+Statistics for `typed_value` columns can be used for file, row group, or page 
skipping when `value` is always NULL (missing).

Review Comment:
   I think Iceberg's current implementation also skips if all of the non-NULL 
values in `value` are null (i.e. `00`), which can be determined from column 
stats. I'm not exactly sure what the intent of the data skipping rules here 
are; it seems like they require semantics for casting and comparing a Variant 
to a scalar, which the next few sentence seem to imply are not being defined 
here.



##########
VariantShredding.md:
##########
@@ -142,26 +155,25 @@ optional group tags (VARIANT) {
 }
 ```
 
-All elements of an array must be present (not missing) because the `array` 
Variant encoding does not allow missing elements.
-That is, either `typed_value` or `value` (but not both) must be non-null.
-Null elements must be encoded in `value` as Variant null: basic type 0 
(primitive) and physical type 0 (null).
+All elements of a variant array must be present (not missing) because the 
`array` Variant arrays cannot encode missing (NULL) elements.
+That is, at least one of `typed_value` or `value` must be present (possibly 
both, if the elements are partially shredded objects).
 
 The series of `tags` arrays `["comedy", "drama"], ["horror", null], ["comedy", 
"drama", "romance"], null` would be stored as:
 
-| Array                            | `value`     | `typed_value `| 
`typed_value...value` | `typed_value...typed_value`    |
+| Array                            | `value`     | `typed_value `| 
`value...value`       | `typed_value...typed_value`    |
 
|----------------------------------|-------------|---------------|-----------------------|--------------------------------|
-| `["comedy", "drama"]`            | null        | non-null      | [null, 
null]          | [`comedy`, `drama`]            |
-| `["horror", null]`               | null        | non-null      | [null, 
`00`]          | [`horror`, null]               |
-| `["comedy", "drama", "romance"]` | null        | non-null      | [null, 
null, null]    | [`comedy`, `drama`, `romance`] |
-| null                             | `00` (null) | null          |             
          |                                |
+| `["comedy", "drama"]`            | NULL        | non-NULL      | [NULL, 
NULL]          | [`comedy`, `drama`]            |
+| `["horror", null]`               | NULL        | non-NULL      | [NULL, 
`00`]          | [`horror`, NULL]            |
+| `["comedy", "drama", "romance"]` | NULL        | non-NULL      | [NULL, 
NULL, NULL]    | [`comedy`, `drama`, `romance`] |
+| NULL                             | `00` (null) | NULL          |             
          |                                |
 
 #### Objects
 
 Fields of an object can be shredded using a Parquet group for `typed_value` 
that contains shredded fields.
 
-If the value is an object, `typed_value` must be non-null.
-If the value is not an object, `typed_value` must be null.
-Readers can assume that a value is not an object if `typed_value` is null and 
that `typed_value` field values are correct; that is, readers do not need to 
read the `value` column if `typed_value` fields satisfy the required fields.
+If the value is an object, `typed_value` must be present.
+If the value is not an object, `typed_value` must be missing.

Review Comment:
   I think `NULL` is clearer than `missing` here, since it's discussing the 
Parquet encoding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to