Re: [Early Feedback] Variant and Subcolumnarization Support

Ryan Blue Thu, 15 Aug 2024 13:29:35 -0700

To follow up on the idea of multiple physical types for a shredded column,
we had a discussion internally about this and I think it's pretty
reasonable to add that later if we end up needing it. I agree that there's
no pressing need to add that complication to the spec.


On Wed, Aug 14, 2024 at 2:41 PM David Cashman <[email protected]>
wrote:

> Hi Selcuk, that's a good point. I don't think the spec doesn't discuss
> how a null in an array should be interpreted. I had assumed that it
> would be an invalid state (and probably should have said so
> explicitly), but I agree that we could specify that it should be
> interpreted as a JSON null.
>
> Thanks,
>
> David
>
> On Wed, Aug 14, 2024 at 5:11 PM Selcuk Aya
> <[email protected]> wrote:
> >
> > Hi David, just to clarify, I think we can shred arrays with json nulls
> without having to use untyped_value column, is this correct?
> >
> > Selcuk
> >
> > On Wed, Aug 14, 2024 at 11:31 PM David Cashman <
> [email protected]> wrote:
> >>
> >> Hi Nick,
> >>
> >> Your understanding is correct. The null in the Variant spec is meant
> >> to encode a JSON null. A row-level value can be SQL null as in any
> >> nullable column, but within a Variant value, there is only the
> >> Variant-encoded null (i.e. JSON null). Some of the Spark expressions
> >> (e.g. cast to a non-Variant type) implicitly convert Variant null to
> >> SQL null.
> >>
> >> In the current version of the shredding spec, the intent is for null
> >> in a shredded column to represent a missing field. Variant null would
> >> need to be encoded in the "untyped_value" column. If JSON null is
> >> expected to be common, it might be reasonable to specify a mechanism
> >> to shred it (e.g. as a boolean column with a unique name).
> >>
> >> Thanks,
> >>
> >> David
> >>
> >> On Wed, Aug 14, 2024 at 11:38 AM Nick Riasanovsky <[email protected]> wrote:
> >> >
> >> > Hello everyone,
> >> >
> >> > As it seems the Variant spec decisions are nearly finalized, I would
> like to ask a clarifying question regarding the difference between SQL Null
> (missing) and JSON Null. Reading through the Spark specification, source
> code, and also experimenting with Spark locally, it seems that the Variant
> type handles the difference between SQL Null and JSON Null at a row level
> and can successfully maintain this difference. However, it seems to me that
> it's never possible for contents of a variant value to contain a SQL Null
> value (only a JSON NULL), such as array(1, missing, 2). Since a variant
> value is recursive, there doesn't appear to be any way to encode a SQL NULL
> in the actual Variant value.
> >> >
> >> > If anyone has any insights that can confirm or reject my
> understanding, I'd greatly appreciate it. I'm trying to become more
> familiar with the Variant encoded and this seemed like it could be a
> potential "gotcha" once column shredding is supported.
> >> >
> >> > Thanks,
> >> > Nick Riasanovsky
>


-- 
Ryan Blue
Databricks

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to