cashmand commented on code in PR #520:
URL: https://github.com/apache/parquet-format/pull/520#discussion_r2298973586
##########
VariantShredding.md:
##########
@@ -110,18 +119,19 @@ Shredded values must use the following Parquet types:
Primitive values can be shredded using the equivalent Parquet primitive type
from the table above for `typed_value`.
-Unless the value is shredded as an object (see [Objects](#objects)),
`typed_value` or `value` (but not both) must be non-null.
+Unless the value is shredded as an object (see [Objects](#objects)), exactly
one of `typed_value` or `value` must be present in each row.
+Missing values in an optional group are encoded as missing values of the
entire group.
#### Arrays
Arrays can be shredded by using a 3-level Parquet list for `typed_value`.
-If the value is not an array, `typed_value` must be null.
-If the value is an array, `value` must be null.
+If the value is not an array, `typed_value` must be missing.
Review Comment:
This may be a matter of preference, so I'm fine with being overruled, but my
preference is to use `missing` to refer to a value that is absent from the
logical Variant value. E.g. in `{"a": 1, "c": 3}`, `b` is missing (even if
present in the shredding spec). But when talking about encoding in Parquet, I
think it makes more sense to use NULL, which I think is more consistent with
the rest of the Parquet spec.
##########
VariantShredding.md:
##########
@@ -110,18 +119,19 @@ Shredded values must use the following Parquet types:
Primitive values can be shredded using the equivalent Parquet primitive type
from the table above for `typed_value`.
-Unless the value is shredded as an object (see [Objects](#objects)),
`typed_value` or `value` (but not both) must be non-null.
+Unless the value is shredded as an object (see [Objects](#objects)), exactly
one of `typed_value` or `value` must be present in each row.
+Missing values in an optional group are encoded as missing values of the
entire group.
#### Arrays
Arrays can be shredded by using a 3-level Parquet list for `typed_value`.
-If the value is not an array, `typed_value` must be null.
-If the value is an array, `value` must be null.
+If the value is not an array, `typed_value` must be missing.
+If the value is an array, `value` must be missing.
The list `element` must be a required group.
The `element` group can contain `value` and `typed_value` fields.
-The element's `value` field stores the element as Variant-encoded `binary`
when the `typed_value` is not present or cannot represent it.
+The element's `value` field stores the element as Variant-encoded `binary`
when the element is missing or `typed_value` cannot represent it.
Review Comment:
I'm not sure what `the element is missing` means here. I think the intent of
the original wording was that if `typed_value` is not present in the shredding
schema, then `value` is effectively required (although that's arguably a
degenerate case of `typed_value` cannot represent it).
##########
VariantShredding.md:
##########
@@ -142,26 +152,26 @@ optional group tags (VARIANT) {
}
```
-All elements of an array must be present (not missing) because the `array`
Variant encoding does not allow missing elements.
-That is, either `typed_value` or `value` (but not both) must be non-null.
-Null elements must be encoded in `value` as Variant null: basic type 0
(primitive) and physical type 0 (null).
+All elements of a variant array must be present (not missing) because the
`array` Variant encoding does not allow missing elements.
+That is, either `typed_value` or `value` (but not both) must be present.
Review Comment:
Nit (not on your change): I think `(but not both)` is wrong and can be
removed. If the shredding schema is array-of-object, then the array elements
can have both `value` and `typed_value` present, where `value` contains the
object fields that do not exist in the array-of-object shredding schema.
##########
VariantShredding.md:
##########
@@ -110,18 +119,19 @@ Shredded values must use the following Parquet types:
Primitive values can be shredded using the equivalent Parquet primitive type
from the table above for `typed_value`.
-Unless the value is shredded as an object (see [Objects](#objects)),
`typed_value` or `value` (but not both) must be non-null.
+Unless the value is shredded as an object (see [Objects](#objects)), exactly
one of `typed_value` or `value` must be present in each row.
+Missing values in an optional group are encoded as missing values of the
entire group.
Review Comment:
I find this sentence a bit hard to make sense of. I think this sentence is
really a comment on the parquet-level logical Variant value, where I think NULL
instead of missing would be more consistent with terminology used in the rest
of parquet. Maybe we could use an example with a full parquet schema.
##########
VariantShredding.md:
##########
@@ -142,26 +152,26 @@ optional group tags (VARIANT) {
}
```
-All elements of an array must be present (not missing) because the `array`
Variant encoding does not allow missing elements.
-That is, either `typed_value` or `value` (but not both) must be non-null.
-Null elements must be encoded in `value` as Variant null: basic type 0
(primitive) and physical type 0 (null).
+All elements of a variant array must be present (not missing) because the
`array` Variant encoding does not allow missing elements.
+That is, either `typed_value` or `value` (but not both) must be present.
+When converting an array with (SQL) nullable elements to variant, missing (SQL
NULL) elements must be encoded in `value` as Variant null (`00`).
Review Comment:
Similar to my comment on object fields: I think the purpose of this
`VariantShredding.md` doc is to provide a spec for translating a Variant binary
(a `metadata`, `value` pair) into a shredded representation in parquet. In that
context, it doesn't really make sense to talk about `an array with (SQL)
nullable elements`. The array must already be a Variant array, and the Variant
binary spec has no way to represent SQL NULL as an array element, only Variant
`null`.
##########
VariantShredding.md:
##########
@@ -192,31 +202,31 @@ optional group event (VARIANT) {
The group for each named field must use repetition level `required`.
-A field's `value` and `typed_value` are set to null (missing) to indicate that
the field does not exist in the variant.
-To encode a field that is present with a null value, the `value` must contain
a Variant null: basic type 0 (primitive) and physical type 0 (null).
+A field's `value` and `typed_value` are both missing to indicate that the
field does not exist in the variant.
+To encode a field that is present with a SQL NULL value, the `value` must
contain a Variant null: basic type 0 (primitive) and physical type 0 (null).
Review Comment:
I think the intent of this section is about converting a Variant binary to
shredded Variant, so it should be `Variant null` in both parts of the sentence,
not `SQL NULL`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]