Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Fri, 31 May 2024 09:54:31 -0700

Hello,

We have drafted the proposal
<https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>
for Variant data type. Please help review and comment.


Thanks,
Aihua

On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> wrote:

> +10000 for a JSON/BSON type. We also had the same discussion internally
> and a JSON type would really play well with for example the SUPER type in
> Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
> and can also provide better integration with the Trino JSON type.
>
> Looking forward to the proposal!
>
> Best,
> Jack Ye
>
>
> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
> <tyler.aki...@snowflake.com.invalid> wrote:
>
>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote:
>>
>>> > We may need some guidance on just how many we need to look at;
>>> > we were planning on Spark and Trino, but weren't sure how much
>>> > further down the rabbit hole we needed to go。
>>>
>>> There are some engines living outside the Java world. It would be
>>> good if the proposal could cover the effort it takes to integrate
>>> variant type to them (e.g. velox, datafusion, etc.). This is something
>>> that
>>> some proprietary iceberg vendors also care about.
>>>
>>
>> Ack, makes sense. We can make sure to share some perspective on this.
>>
>> > Not necessarily, no. As long as there's a binary type and Iceberg and
>>> > the query engines are aware that the binary column needs to be
>>> > interpreted as a variant, that should be sufficient.
>>>
>>> From the perspective of interoperability, it would be good to support
>>> native
>>> type from file specs. Life will be easier for projects like Apache
>>> XTable.
>>> File format could also provide finer-grained statistics for variant type
>>> which
>>> facilitates data skipping.
>>>
>>
>> Agreed, there can definitely be additional value in native file format
>> integration. Just wanted to highlight that it's not a strict requirement.
>>
>> -Tyler
>>
>>
>>>
>>> Gang
>>>
>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>
>>>> Good to see you again as well, JB! Thanks!
>>>>
>>>> -Tyler
>>>>
>>>>
>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi Tyler,
>>>>>
>>>>> Super happy to see you there :) It reminds me our discussions back in
>>>>> the start of Apache Beam :)
>>>>>
>>>>> Anyway, the thread is pretty interesting. I remember some discussions
>>>>> about JSON datatype for spec v3. The binary data type is already
>>>>> supported in the spec v2.
>>>>>
>>>>> I'm looking forward to the proposal and happy to help on this !
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>>>> >
>>>>> > Hello,
>>>>> >
>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for
>>>>> which we’d like to get early feedback from the community. As you may know,
>>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>>>>> good progress on our own adoption of the Iceberg standard, we’re now in a
>>>>> position where there are features not yet supported in Iceberg which we
>>>>> think would be valuable for our users, and that we would like to discuss
>>>>> with and help contribute to the Iceberg community.
>>>>> >
>>>>> > The first two such features we’d like to discuss are in support of
>>>>> efficient querying of dynamically typed, semi-structured data: variant 
>>>>> data
>>>>> types, and subcolumnarization of variant columns. In more detail, for
>>>>> anyone who may not already be familiar:
>>>>> >
>>>>> > 1. Variant data types
>>>>> > Variant types allow for the efficient binary encoding of dynamic
>>>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>>>>> data as a variant column, we retain the flexibility of the source data,
>>>>> while allowing query engines to more efficiently operate on the data.
>>>>> Snowflake has supported the variant data type on Snowflake tables for many
>>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>>>>> we’re hearing an increasing chorus of requests for variant support.
>>>>> Additionally, other query engines such as Apache Spark have begun adding
>>>>> variant support [2]. As such, we believe it would be beneficial to the
>>>>> Iceberg community as a whole to standardize on the variant data type
>>>>> encoding used across Iceberg tables.
>>>>> >
>>>>> > One specific point to make here is that, since an Apache OSS version
>>>>> of variant encoding already exists in Spark, it likely makes sense to
>>>>> simply adopt the Spark encoding as the Iceberg standard as well. The
>>>>> encoding we use internally today in Snowflake is slightly different, but
>>>>> essentially equivalent, and we see no particular value in trying to 
>>>>> clutter
>>>>> the space with another equivalent-but-incompatible encoding.
>>>>> >
>>>>> >
>>>>> > 2. Subcolumnarization
>>>>> > Subcolumnarization of variant columns allows query engines to
>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a
>>>>> variant column are queried, and also allows optionally materializing some
>>>>> of the nested fields as a column on their own, affording queries on these
>>>>> subcolumns the ability to read less data and spend less CPU on extraction.
>>>>> When subcolumnarizing, the system managing table metadata and data tracks
>>>>> individual pruning statistics (min, max, null, etc.) for some subset of 
>>>>> the
>>>>> nested fields within a variant, and also manages any optional
>>>>> materialization. Without subcolumnarization, any query which touches a
>>>>> variant column must read, parse, extract, and filter every row for which
>>>>> that column is non-null. Thus, by providing a standardized way of tracking
>>>>> subcolum metadata and data for variant columns, Iceberg can make
>>>>> subcolumnar optimizations accessible across various catalogs and query
>>>>> engines.
>>>>> >
>>>>> > Subcolumnarization is a non-trivial topic, so we expect any concrete
>>>>> proposal to include not only the set of changes to Iceberg metadata that
>>>>> allow compatible query engines to interopate on subcolumnarization data 
>>>>> for
>>>>> variant columns, but also reference documentation explaining
>>>>> subcolumnarization principles and recommended best practices.
>>>>> >
>>>>> >
>>>>> > It sounds like the recent Geo proposal [3] may be a good starting
>>>>> point for how to approach this, so our plan is to write something up in
>>>>> that vein that covers the proposed spec changes, backwards compatibility,
>>>>> implementor burdens, etc. But we wanted to first reach out to the 
>>>>> community
>>>>> to introduce ourselves and the idea, and see if there’s any early feedback
>>>>> we should incorporate before we spend too much time on a concrete 
>>>>> proposal.
>>>>> >
>>>>> > Thank you!
>>>>> >
>>>>> > [1]
>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>> > [2]
>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>> > [3]
>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>> >
>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>> >
>>>>>
>>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to