scovich commented on issue #519:
URL: https://github.com/apache/parquet-format/issues/519#issuecomment-3397769385

   > I did some thinking on this, and I think I understand the distinction 
between the two null types. Feel free to correct me if I'm wrong however.
   > 
   > `VARIANT_NULL` only exists because of the need to differentiate between a 
NULL value for a field, and the field value being missing, in the case of 
shredding. This is explained in the spec, 
[here](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#objects)
   > 
   > > A field's value and typed_value are set to null (missing) to indicate 
that the field does not exist in the variant. To encode a field that is present 
with a null value, the value must contain a Variant null: basic type 0 
(primitive) and physical type 0 (null).
   > 
   > I don't think there should be a distinction between `SQL NULL` and 
`VARIANT NULL` in the emitted value when the Variant column is scanned. The 
distinction should only exist at the storage level.
   
   From what I understand, variant null is equivalent to the JSON `null`, which 
is [one of the seven official JSON values](https://www.json.org/json-en.html):
   > A _value_ can be a _string_ in double quotes, or a _number_, or `true` or 
`false` or `null`, or an _object_ or an _array_.
   > <img src="https://www.json.org/img/value.png"; width=500px>
   
   The `null` value is very much present. It is neither _missing_ (a state the 
variant spec recognizes) nor SQL NULL/UNKNOWN (a state the variant spec 
actually says nothing about). 
   
   Put another way -- although casting JSON/variant `null` to any other type 
produces a SQL NULL value, it is still a well-defined value and thus cannot be 
SQL NULL. For example, in spark the expression `v IS NULL` returns false for 
rows where `v` is a variant null, and true for rows where `v` is missing, but 
`v::INT IS NULL` returns true for both cases (because casting `null` to int 
produces NULL). There's a separate `is_variant_null` function to distinguish 
the two situations.
   
   The specific bit of spec you quoted tells readers what to do if they 
encounter physically valid parquet that encodes a logically _invalid_ flavor of 
shredded variant storage because a `value` column (or column value) is wrongly 
missing. 
   
   For example, suppose that variant column `v` is shredded as an array with 
struct elements, where each struct contains a single field `f` of type int32:
   
   ```
   optional group v (VARIANT) {
     required binary metadata;
     optional binary value;
     optional group typed_value (LIST) {   # must be optional to allow a null 
list
       repeated group list {
         required group element {          # shredded element of the variant 
array
           optional binary value;            # a variant, expected to be an 
object
           optional group typed_value {      # shredded fields for the variant 
object
             required group f {              # shredded field for f
               optional binary value;
               optional int32 typed_value;
             }
           }
         }
       }
     }
   }    
   ```
   With the above shredding schema:
   1. It is _illegal_ for both `v.value` and `v.typed_value` to be missing -- a 
NULL value of `v` must be encoded as missing `group v`. 
   2. It is _illegal_ for `v.typed_value.list.element` to be missing -- variant 
arrays cannot encode missing values
   3. It is _illegal_ for both `v.typed_value.list.element.value` and 
`v.typed_value.list.element.typed_value` to be missing -- variant arrays cannot 
encode missing values
   4. It is _illegal_ for `v.typed_value.list.element.typed_value.f` to be 
missing. Instead, a missing field `f` is inferred when both 
`v.typed_value.list.element.typed_value.f.value` and 
`v.typed_value.list.element.typed_value.f.typed_value` are missing.
   
   For all four illegal cases above, the shredding spec requires a reader to 
compensate by replacing the wrongly-missing value with a variant null value -- 
not missing or SQL NULL. 
   
   It's not clear to me whether the spec forbids a reader to error out in these 
cases, or if it merely constrains the behavior of a reader who chooses not to 
error out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to