Hi Nick,

Your understanding is correct. The null in the Variant spec is meant
to encode a JSON null. A row-level value can be SQL null as in any
nullable column, but within a Variant value, there is only the
Variant-encoded null (i.e. JSON null). Some of the Spark expressions
(e.g. cast to a non-Variant type) implicitly convert Variant null to
SQL null.

In the current version of the shredding spec, the intent is for null
in a shredded column to represent a missing field. Variant null would
need to be encoded in the "untyped_value" column. If JSON null is
expected to be common, it might be reasonable to specify a mechanism
to shred it (e.g. as a boolean column with a unique name).

Thanks,

David

On Wed, Aug 14, 2024 at 11:38 AM Nick Riasanovsky <n...@bodo.ai> wrote:
>
> Hello everyone,
>
> As it seems the Variant spec decisions are nearly finalized, I would like to 
> ask a clarifying question regarding the difference between SQL Null (missing) 
> and JSON Null. Reading through the Spark specification, source code, and also 
> experimenting with Spark locally, it seems that the Variant type handles the 
> difference between SQL Null and JSON Null at a row level and can successfully 
> maintain this difference. However, it seems to me that it's never possible 
> for contents of a variant value to contain a SQL Null value (only a JSON 
> NULL), such as array(1, missing, 2). Since a variant value is recursive, 
> there doesn't appear to be any way to encode a SQL NULL in the actual Variant 
> value.
>
> If anyone has any insights that can confirm or reject my understanding, I'd 
> greatly appreciate it. I'm trying to become more familiar with the Variant 
> encoded and this seemed like it could be a potential "gotcha" once column 
> shredding is supported.
>
> Thanks,
> Nick Riasanovsky

Reply via email to