Re: [Early Feedback] Variant and Subcolumnarization Support

Yufei Gu Sat, 11 May 2024 13:18:14 -0700

Sounds like a good idea. Looking forward to a proposal.
Yufei


On Sat, May 11, 2024 at 9:27 AM Amogh Jahagirdar <am...@tabular.io> wrote:

> Hi all,
>
> Thanks for raising this thread! Overall I agree that some forms of variant
> types would be useful in Iceberg and we've seen interest in support for
> that as well. Specifically, there has been discussion on adding an
> optimized JSON data types as part of the V3 spec
> <https://github.com/apache/iceberg/issues/9066>.
> I have some questions on some of the points that were brought up:
>
> > One specific point to make here is that, since an Apache OSS version of
> variant encoding already exists in Spark, it likely makes sense to simply
> adopt the Spark encoding as the Iceberg standard as well. The encoding we
> use internally today in Snowflake is slightly different, but essentially
> equivalent, and we see no particular value in trying to clutter the space
> with another equivalent-but-incompatible encoding.
>
> 1.)  Isn't the variant encoding in Spark essentially a binary data type
> (i.e. an array of arbitrary bytes)? Iceberg already has a binary data
> type <https://iceberg.apache.org/spec/#primitive-types>defined as well. I
> would think that we would want to spec out something along the lines of a
> JSONB data type (how are keys organized in the structure, null values etc).
> It's worth looking at what systems like Postgres do.
> In general, I think we would want to choose an encoding scheme which has
> good compression characteristics that balances with the decoding/reading
> performance. We do want to make engine integration easy as possible (so
> that common json functions that engines expose)
>
> 2.) I'm less familiar with subcolumnarization, but it sounds like a
> technique for pruning and efficient materialization of nested fields which
> is very interesting, but
>  I think I'd probably try and get consensus on the specific data types we
> want to add, and spec those out first, with integration considerations
> (Spark, Trino, Python, etc). I think there's enough to unpack there that's
> worth a discussion, before jumping into more complex optimizations. Certain
> cases may be inefficient without this technique but I think it's quite
> useful to at least start out with the data type definitions and engine
> integrations. We could then possibly look at support for this technique
> afterwards. Wonder what others think though.
>
> > so our plan is to write something up in that vein that covers the
> proposed spec changes, backwards compatibility, implementor burdens, etc.
> 3.) Minor comment, since I'm assuming this is considered as
> "implementation burdens" but I think a proposal should also ideally cover
> any relevant details for integrations with Spark, Trino, PyIceberg, etc. At
> least to make sure the community is aware of any challenges in integration
> (if there's any), so we can make progress in those areas and get adoption
> for these new data types!
>
> Overall this is very exciting!
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Fri, May 10, 2024 at 11:28 PM Gang Wu <ust...@gmail.com> wrote:
>
>> Hi,
>>
>> This sounds very interesting!
>>
>> IIUC, the current variant type in the Apache Spark stores data in the
>> BINARY type. When it comes to subcolumnarization, does it require the file
>> format (e.g. Apache Parquet/ORC/Avro) to support variant type natively?
>>
>> Best,
>> Gang
>>
>> On Sat, May 11, 2024 at 1:07 PM Tyler Akidau
>> <tyler.aki...@snowflake.com.invalid> wrote:
>>
>>> Hello,
>>>
>>> We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which
>>> we’d like to get early feedback from the community. As you may know,
>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>>> good progress on our own adoption of the Iceberg standard, we’re now in a
>>> position where there are features not yet supported in Iceberg which we
>>> think would be valuable for our users, and that we would like to discuss
>>> with and help contribute to the Iceberg community.
>>>
>>> The first two such features we’d like to discuss are in support of
>>> efficient querying of dynamically typed, semi-structured data: variant data
>>> types, and subcolumnarization of variant columns. In more detail, for
>>> anyone who may not already be familiar:
>>>
>>> 1. Variant data types
>>> Variant types allow for the efficient binary encoding of dynamic
>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>>> data as a variant column, we retain the flexibility of the source data,
>>> while allowing query engines to more efficiently operate on the data.
>>> Snowflake has supported the variant data type on Snowflake tables for many
>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>>> we’re hearing an increasing chorus of requests for variant support.
>>> Additionally, other query engines such as Apache Spark have begun adding
>>> variant support [2]. As such, we believe it would be beneficial to the
>>> Iceberg community as a whole to standardize on the variant data type
>>> encoding used across Iceberg tables.
>>>
>>> One specific point to make here is that, since an Apache OSS version of
>>> variant encoding already exists in Spark, it likely makes sense to simply
>>> adopt the Spark encoding as the Iceberg standard as well. The encoding we
>>> use internally today in Snowflake is slightly different, but essentially
>>> equivalent, and we see no particular value in trying to clutter the space
>>> with another equivalent-but-incompatible encoding.
>>>
>>>
>>> 2. Subcolumnarization
>>> Subcolumnarization of variant columns allows query engines to
>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a
>>> variant column are queried, and also allows optionally materializing some
>>> of the nested fields as a column on their own, affording queries on these
>>> subcolumns the ability to read less data and spend less CPU on extraction.
>>> When subcolumnarizing, the system managing table metadata and data tracks
>>> individual pruning statistics (min, max, null, etc.) for some subset of the
>>> nested fields within a variant, and also manages any optional
>>> materialization. Without subcolumnarization, any query which touches a
>>> variant column must read, parse, extract, and filter every row for which
>>> that column is non-null. Thus, by providing a standardized way of tracking
>>> subcolum metadata and data for variant columns, Iceberg can make
>>> subcolumnar optimizations accessible across various catalogs and query
>>> engines.
>>>
>>> Subcolumnarization is a non-trivial topic, so we expect any concrete
>>> proposal to include not only the set of changes to Iceberg metadata that
>>> allow compatible query engines to interopate on subcolumnarization data for
>>> variant columns, but also reference documentation explaining
>>> subcolumnarization principles and recommended best practices.
>>>
>>>
>>> It sounds like the recent Geo proposal [3] may be a good starting point
>>> for how to approach this, so our plan is to write something up in that vein
>>> that covers the proposed spec changes, backwards compatibility, implementor
>>> burdens, etc. But we wanted to first reach out to the community to
>>> introduce ourselves and the idea, and see if there’s any early feedback we
>>> should incorporate before we spend too much time on a concrete proposal.
>>>
>>> Thank you!
>>>
>>> [1]
>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>> [2] https://github.com/apache/spark/blob/master/common/variant/README.md
>>> [3]
>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>
>>> -Tyler, Nileema, Selcuk, Aihua
>>>
>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to