Re: [Early Feedback] Variant and Subcolumnarization Support

Tyler Akidau Tue, 14 May 2024 12:25:36 -0700

Thank you all for the supportive words and feedback! Condensed responses
inline below.


On Sat, May 11, 2024 at 9:27 AM Amogh Jahagirdar <am...@tabular.io> wrote:

> Thanks for raising this thread! Overall I agree that some forms of variant
> types would be useful in Iceberg and we've seen interest in support for
> that as well. Specifically, there has been discussion on adding an
> optimized JSON data types as part of the V3 spec
> <https://github.com/apache/iceberg/issues/9066>.
>

Thanks for the reference. That looks like an empty placeholder Jira, and
searching the dev list I'm not coming up with any discussions around JSONB
or BSON. Just to be sure I'm not missing anything, are there any artifacts
around those discussions, or have they just happened offline and there's
only the placeholder to capture the idea for now?


> I have some questions on some of the points that were brought up:
>
> > One specific point to make here is that, since an Apache OSS version of
> variant encoding already exists in Spark, it likely makes sense to simply
> adopt the Spark encoding as the Iceberg standard as well. The encoding we
> use internally today in Snowflake is slightly different, but essentially
> equivalent, and we see no particular value in trying to clutter the space
> with another equivalent-but-incompatible encoding.
>
> 1.)  Isn't the variant encoding in Spark essentially a binary data type
> (i.e. an array of arbitrary bytes)? Iceberg already has a binary data
> type <https://iceberg.apache.org/spec/#primitive-types>defined as well. I
> would think that we would want to spec out something along the lines of a
> JSONB data type (how are keys organized in the structure, null values etc).
> It's worth looking at what systems like Postgres do.
> In general, I think we would want to choose an encoding scheme which has
> good compression characteristics that balances with the decoding/reading
> performance. We do want to make engine integration easy as possible (so
> that common json functions that engines expose)
>

Yes, any full blown variant (or equivalent) definition is going to cover
nesting, nulls, etc. It does seem worth contrasting JSONB and BSON
alongside the Snowflake and Spark variant implementations. We can roll such
an analysis into our proposal.


> 2.) I'm less familiar with subcolumnarization, but it sounds like a
> technique for pruning and efficient materialization of nested fields which
> is very interesting, but
>  I think I'd probably try and get consensus on the specific data types we
> want to add, and spec those out first, with integration considerations
> (Spark, Trino, Python, etc). I think there's enough to unpack there that's
> worth a discussion, before jumping into more complex optimizations. Certain
> cases may be inefficient without this technique but I think it's quite
> useful to at least start out with the data type definitions and engine
> integrations. We could then possibly look at support for this technique
> afterwards. Wonder what others think though.
>

>From my perspective, I think it's fine to treat them as somewhat separate,
as you're correct that subcolumnarization is essentially a performance
optimization that builds on the variant datatype feature. That said, I'm
not sure I would strictly sequence them, as there are a number of variant
workloads which are simply unusable without subcolumnarization. So assuming
we reach general consensus to head towards supporting *some* kind of
variant style datatype, it may be worthwhile getting started on hammering
out the subcolumnarization details before the variant stuff has fully
landed. The variant proposal will certainly come first however, and we can
decide as we go when it makes sense to start discussing subcolumnarization.


> > so our plan is to write something up in that vein that covers the
> proposed spec changes, backwards compatibility, implementor burdens, etc.
> 3.) Minor comment, since I'm assuming this is considered as
> "implementation burdens" but I think a proposal should also ideally cover
> any relevant details for integrations with Spark, Trino, PyIceberg, etc. At
> least to make sure the community is aware of any challenges in integration
> (if there's any), so we can make progress in those areas and get adoption
> for these new data types!
>

Yes, our intent with that statement was to try to spec out the major
integrations. We may need some guidance on just how many we need to look
at; we were planning on Spark and Trino, but weren't sure how much further
down the rabbit hole we needed to go. But I think we can sort that out as
we work through the proposals.


> On Fri, May 10, 2024 at 11:28 PM Gang Wu <ust...@gmail.com> wrote:
>
>> Hi,
>>
>> This sounds very interesting!
>>
>> IIUC, the current variant type in the Apache Spark stores data in the
>> BINARY type. When it comes to subcolumnarization, does it require the file
>> format (e.g. Apache Parquet/ORC/Avro) to support variant type natively?
>>
>
Not necessarily, no. As long as there's a binary type and Iceberg and the
query engines are aware that the binary column needs to be interpreted as a
variant, that should be sufficient.

On Sat, May 11, 2024 at 4:45 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Tyler, et. al.,
> I think some sort of semi-structured type is a good idea.  I think one
> important question is whether to support Variant, JSON or another
> representation of semi-structured data as the user facing data type.
>
> Please correct me if I'm wrong, but I think Variant is mostly a superset
> of JSON where scalar values have a richer type system (e.g. different byte
> width for ints and logical types logical types like timestamp)?
>

Correct, Variant is a superset of JSON. It looks like BSON is a superset as
well. As mentioned above, we'll include an analysis of JSONB and BSON in
our proposal so we can discuss the merits of extending something in the
JSON lineage vs going with something new.


> Also, I think JSON has been standardized as a type in the SQL
> specification but Variant types are still mostly vendor specific?
>

JSON, yes, but not JSONB IIUC. And yes, variants are currently mostly
vendor specific.

-Tyler

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to