scovich commented on issue #519: URL: https://github.com/apache/parquet-format/issues/519#issuecomment-3397769385
> I did some thinking on this, and I think I understand the distinction between the two null types. Feel free to correct me if I'm wrong however. > > `VARIANT_NULL` only exists because of the need to differentiate between a NULL value for a field, and the field value being missing, in the case of shredding. This is explained in the spec, [here](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects) > > > A field's value and typed_value are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a null value, the value must contain a Variant null: basic type 0 (primitive) and physical type 0 (null). > > I don't think there should be a distinction between `SQL NULL` and `VARIANT NULL` in the emitted value when the Variant column is scanned. The distinction should only exist at the storage level. From what I understand, variant null is equivalent to the JSON `null`, which is [one of the seven official JSON values](https://www.json.org/json-en.html): > A _value_ can be a _string_ in double quotes, or a _number_, or `true` or `false` or `null`, or an _object_ or an _array_. > <img src="https://www.json.org/img/value.png" width=500px> The `null` value is very much present. It is neither _missing_ (a state the variant spec recognizes) nor SQL NULL/UNKNOWN (a state the variant spec actually says nothing about). Put another way -- although casting JSON/variant `null` to any other type produces a SQL NULL value, it is still a well-defined value and thus cannot be SQL NULL. For example, in spark the expression `v IS NULL` returns false for rows where `v` is a variant null, and true for rows where `v` is missing, but `v::INT IS NULL` returns true for both cases (because casting `null` to int produces NULL). There's a separate `is_variant_null` function to distinguish the two situations. The specific bit of spec you quoted tells readers what to do if they encounter physically valid parquet that encodes a logically _invalid_ flavor of shredded variant storage because a `value` column (or column value) is wrongly missing. For example, suppose that variant column `v` is shredded as an array with struct elements, where each struct contains a single field `f` of type int32: ``` optional group v (VARIANT) { required binary metadata; optional binary value; optional group typed_value (LIST) { # must be optional to allow a null list repeated group list { required group element { # shredded element of the variant array optional binary value; # a variant, expected to be an object optional group typed_value { # shredded fields for the variant object required group f { # shredded field for f optional binary value; optional int32 typed_value; } } } } } } ``` With the above shredding schema: 1. It is _illegal_ for both `v.value` and `v.typed_value` to be missing -- a NULL value of `v` must be encoded as missing `group v`. 2. It is _illegal_ for `v.typed_value.list.element` to be missing -- variant arrays cannot encode missing values 3. It is _illegal_ for both `v.typed_value.list.element.value` and `v.typed_value.list.element.typed_value` to be missing -- variant arrays cannot encode missing values 4. It is _illegal_ for `v.typed_value.list.element.typed_value.f` to be missing. Instead, a missing field `f` is inferred when both `v.typed_value.list.element.typed_value.f.value` and `v.typed_value.list.element.typed_value.f.typed_value` are missing. For all four illegal cases above, the shredding spec requires a reader to compensate by replacing the wrongly-missing value with a variant null value -- not missing or SQL NULL. It's not clear to me whether the spec forbids a reader to error out in these cases, or if it merely constrains the behavior of a reader who chooses not to error out? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
