Hi Nick, Your understanding is correct. The null in the Variant spec is meant to encode a JSON null. A row-level value can be SQL null as in any nullable column, but within a Variant value, there is only the Variant-encoded null (i.e. JSON null). Some of the Spark expressions (e.g. cast to a non-Variant type) implicitly convert Variant null to SQL null.
In the current version of the shredding spec, the intent is for null in a shredded column to represent a missing field. Variant null would need to be encoded in the "untyped_value" column. If JSON null is expected to be common, it might be reasonable to specify a mechanism to shred it (e.g. as a boolean column with a unique name). Thanks, David On Wed, Aug 14, 2024 at 11:38 AM Nick Riasanovsky <n...@bodo.ai> wrote: > > Hello everyone, > > As it seems the Variant spec decisions are nearly finalized, I would like to > ask a clarifying question regarding the difference between SQL Null (missing) > and JSON Null. Reading through the Spark specification, source code, and also > experimenting with Spark locally, it seems that the Variant type handles the > difference between SQL Null and JSON Null at a row level and can successfully > maintain this difference. However, it seems to me that it's never possible > for contents of a variant value to contain a SQL Null value (only a JSON > NULL), such as array(1, missing, 2). Since a variant value is recursive, > there doesn't appear to be any way to encode a SQL NULL in the actual Variant > value. > > If anyone has any insights that can confirm or reject my understanding, I'd > greatly appreciate it. I'm trying to become more familiar with the Variant > encoded and this seemed like it could be a potential "gotcha" once column > shredding is supported. > > Thanks, > Nick Riasanovsky