Re: [Early Feedback] Variant and Subcolumnarization Support

Tyler Akidau Wed, 15 May 2024 09:37:00 -0700

On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]> wrote:

> > We may need some guidance on just how many we need to look at;
> > we were planning on Spark and Trino, but weren't sure how much
> > further down the rabbit hole we needed to go。
>
> There are some engines living outside the Java world. It would be
> good if the proposal could cover the effort it takes to integrate
> variant type to them (e.g. velox, datafusion, etc.). This is something that
> some proprietary iceberg vendors also care about.
>


Ack, makes sense. We can make sure to share some perspective on this.

> Not necessarily, no. As long as there's a binary type and Iceberg and
> > the query engines are aware that the binary column needs to be
> > interpreted as a variant, that should be sufficient.
>
> From the perspective of interoperability, it would be good to support
> native
> type from file specs. Life will be easier for projects like Apache XTable.
> File format could also provide finer-grained statistics for variant type
> which
> facilitates data skipping.
>

Agreed, there can definitely be additional value in native file format
integration. Just wanted to highlight that it's not a strict requirement.

-Tyler


>
> Gang
>
> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
> <[email protected]> wrote:
>
>> Good to see you again as well, JB! Thanks!
>>
>> -Tyler
>>
>>
>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>>> Hi Tyler,
>>>
>>> Super happy to see you there :) It reminds me our discussions back in
>>> the start of Apache Beam :)
>>>
>>> Anyway, the thread is pretty interesting. I remember some discussions
>>> about JSON datatype for spec v3. The binary data type is already
>>> supported in the spec v2.
>>>
>>> I'm looking forward to the proposal and happy to help on this !
>>>
>>> Regards
>>> JB
>>>
>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>> <[email protected]> wrote:
>>> >
>>> > Hello,
>>> >
>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which
>>> we’d like to get early feedback from the community. As you may know,
>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>>> good progress on our own adoption of the Iceberg standard, we’re now in a
>>> position where there are features not yet supported in Iceberg which we
>>> think would be valuable for our users, and that we would like to discuss
>>> with and help contribute to the Iceberg community.
>>> >
>>> > The first two such features we’d like to discuss are in support of
>>> efficient querying of dynamically typed, semi-structured data: variant data
>>> types, and subcolumnarization of variant columns. In more detail, for
>>> anyone who may not already be familiar:
>>> >
>>> > 1. Variant data types
>>> > Variant types allow for the efficient binary encoding of dynamic
>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>>> data as a variant column, we retain the flexibility of the source data,
>>> while allowing query engines to more efficiently operate on the data.
>>> Snowflake has supported the variant data type on Snowflake tables for many
>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>>> we’re hearing an increasing chorus of requests for variant support.
>>> Additionally, other query engines such as Apache Spark have begun adding
>>> variant support [2]. As such, we believe it would be beneficial to the
>>> Iceberg community as a whole to standardize on the variant data type
>>> encoding used across Iceberg tables.
>>> >
>>> > One specific point to make here is that, since an Apache OSS version
>>> of variant encoding already exists in Spark, it likely makes sense to
>>> simply adopt the Spark encoding as the Iceberg standard as well. The
>>> encoding we use internally today in Snowflake is slightly different, but
>>> essentially equivalent, and we see no particular value in trying to clutter
>>> the space with another equivalent-but-incompatible encoding.
>>> >
>>> >
>>> > 2. Subcolumnarization
>>> > Subcolumnarization of variant columns allows query engines to
>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a
>>> variant column are queried, and also allows optionally materializing some
>>> of the nested fields as a column on their own, affording queries on these
>>> subcolumns the ability to read less data and spend less CPU on extraction.
>>> When subcolumnarizing, the system managing table metadata and data tracks
>>> individual pruning statistics (min, max, null, etc.) for some subset of the
>>> nested fields within a variant, and also manages any optional
>>> materialization. Without subcolumnarization, any query which touches a
>>> variant column must read, parse, extract, and filter every row for which
>>> that column is non-null. Thus, by providing a standardized way of tracking
>>> subcolum metadata and data for variant columns, Iceberg can make
>>> subcolumnar optimizations accessible across various catalogs and query
>>> engines.
>>> >
>>> > Subcolumnarization is a non-trivial topic, so we expect any concrete
>>> proposal to include not only the set of changes to Iceberg metadata that
>>> allow compatible query engines to interopate on subcolumnarization data for
>>> variant columns, but also reference documentation explaining
>>> subcolumnarization principles and recommended best practices.
>>> >
>>> >
>>> > It sounds like the recent Geo proposal [3] may be a good starting
>>> point for how to approach this, so our plan is to write something up in
>>> that vein that covers the proposed spec changes, backwards compatibility,
>>> implementor burdens, etc. But we wanted to first reach out to the community
>>> to introduce ourselves and the idea, and see if there’s any early feedback
>>> we should incorporate before we spend too much time on a concrete proposal.
>>> >
>>> > Thank you!
>>> >
>>> > [1]
>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>> > [2]
>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>> > [3]
>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>> >
>>> > -Tyler, Nileema, Selcuk, Aihua
>>> >
>>>
>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to