Re: [Early Feedback] Variant and Subcolumnarization Support

Fokko Driesprong Tue, 25 Jun 2024 06:49:09 -0700

Hey Aihua and Tyler,

Thanks again for raising this. I reviewed the proposal and it looks good,
also thanks to everyone for jumping in and providing feedback. Looking at
the proposal and comments, I think the biggest open issue that needs to be
decided is Subcolumnarization vs native type
<https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit?disco=AAABOfRqAn8>.
The Spark variant encoding also has an open PR on the subject of
subcolumnarization <https://github.com/apache/spark/pull/46831/>.


To keep this moving, I would suggest that the document's authors go over
the open issues and try to resolve low-hanging fruit. This will clean up
the proposal already quite a bit. Then we can come up with a list of open
questions (happy to help) and have a meeting to discuss these. WDYT?

Kind regards,
Fokko Driesprong

Op vr 31 mei 2024 om 18:54 schreef Aihua Xu <[email protected]
>:

> Hello,
>
> We have drafted the proposal
> <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>
> for Variant data type. Please help review and comment.
>
> Thanks,
> Aihua
>
> On Thu, May 16, 2024 at 12:45 PM Jack Ye <[email protected]> wrote:
>
>> +10000 for a JSON/BSON type. We also had the same discussion internally
>> and a JSON type would really play well with for example the SUPER type in
>> Redshift:
>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and
>> can also provide better integration with the Trino JSON type.
>>
>> Looking forward to the proposal!
>>
>> Best,
>> Jack Ye
>>
>>
>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>> <[email protected]> wrote:
>>
>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]> wrote:
>>>
>>>> > We may need some guidance on just how many we need to look at;
>>>> > we were planning on Spark and Trino, but weren't sure how much
>>>> > further down the rabbit hole we needed to go。
>>>>
>>>> There are some engines living outside the Java world. It would be
>>>> good if the proposal could cover the effort it takes to integrate
>>>> variant type to them (e.g. velox, datafusion, etc.). This is something
>>>> that
>>>> some proprietary iceberg vendors also care about.
>>>>
>>>
>>> Ack, makes sense. We can make sure to share some perspective on this.
>>>
>>> > Not necessarily, no. As long as there's a binary type and Iceberg and
>>>> > the query engines are aware that the binary column needs to be
>>>> > interpreted as a variant, that should be sufficient.
>>>>
>>>> From the perspective of interoperability, it would be good to support
>>>> native
>>>> type from file specs. Life will be easier for projects like Apache
>>>> XTable.
>>>> File format could also provide finer-grained statistics for variant
>>>> type which
>>>> facilitates data skipping.
>>>>
>>>
>>> Agreed, there can definitely be additional value in native file format
>>> integration. Just wanted to highlight that it's not a strict requirement.
>>>
>>> -Tyler
>>>
>>>
>>>>
>>>> Gang
>>>>
>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>> <[email protected]> wrote:
>>>>
>>>>> Good to see you again as well, JB! Thanks!
>>>>>
>>>>> -Tyler
>>>>>
>>>>>
>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Tyler,
>>>>>>
>>>>>> Super happy to see you there :) It reminds me our discussions back in
>>>>>> the start of Apache Beam :)
>>>>>>
>>>>>> Anyway, the thread is pretty interesting. I remember some discussions
>>>>>> about JSON datatype for spec v3. The binary data type is already
>>>>>> supported in the spec v2.
>>>>>>
>>>>>> I'm looking forward to the proposal and happy to help on this !
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>>>> <[email protected]> wrote:
>>>>>> >
>>>>>> > Hello,
>>>>>> >
>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for
>>>>>> which we’d like to get early feedback from the community. As you may 
>>>>>> know,
>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made
>>>>>> good progress on our own adoption of the Iceberg standard, we’re now in a
>>>>>> position where there are features not yet supported in Iceberg which we
>>>>>> think would be valuable for our users, and that we would like to discuss
>>>>>> with and help contribute to the Iceberg community.
>>>>>> >
>>>>>> > The first two such features we’d like to discuss are in support of
>>>>>> efficient querying of dynamically typed, semi-structured data: variant 
>>>>>> data
>>>>>> types, and subcolumnarization of variant columns. In more detail, for
>>>>>> anyone who may not already be familiar:
>>>>>> >
>>>>>> > 1. Variant data types
>>>>>> > Variant types allow for the efficient binary encoding of dynamic
>>>>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
>>>>>> data as a variant column, we retain the flexibility of the source data,
>>>>>> while allowing query engines to more efficiently operate on the data.
>>>>>> Snowflake has supported the variant data type on Snowflake tables for 
>>>>>> many
>>>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
>>>>>> we’re hearing an increasing chorus of requests for variant support.
>>>>>> Additionally, other query engines such as Apache Spark have begun adding
>>>>>> variant support [2]. As such, we believe it would be beneficial to the
>>>>>> Iceberg community as a whole to standardize on the variant data type
>>>>>> encoding used across Iceberg tables.
>>>>>> >
>>>>>> > One specific point to make here is that, since an Apache OSS
>>>>>> version of variant encoding already exists in Spark, it likely makes 
>>>>>> sense
>>>>>> to simply adopt the Spark encoding as the Iceberg standard as well. The
>>>>>> encoding we use internally today in Snowflake is slightly different, but
>>>>>> essentially equivalent, and we see no particular value in trying to 
>>>>>> clutter
>>>>>> the space with another equivalent-but-incompatible encoding.
>>>>>> >
>>>>>> >
>>>>>> > 2. Subcolumnarization
>>>>>> > Subcolumnarization of variant columns allows query engines to
>>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a
>>>>>> variant column are queried, and also allows optionally materializing some
>>>>>> of the nested fields as a column on their own, affording queries on these
>>>>>> subcolumns the ability to read less data and spend less CPU on 
>>>>>> extraction.
>>>>>> When subcolumnarizing, the system managing table metadata and data tracks
>>>>>> individual pruning statistics (min, max, null, etc.) for some subset of 
>>>>>> the
>>>>>> nested fields within a variant, and also manages any optional
>>>>>> materialization. Without subcolumnarization, any query which touches a
>>>>>> variant column must read, parse, extract, and filter every row for which
>>>>>> that column is non-null. Thus, by providing a standardized way of 
>>>>>> tracking
>>>>>> subcolum metadata and data for variant columns, Iceberg can make
>>>>>> subcolumnar optimizations accessible across various catalogs and query
>>>>>> engines.
>>>>>> >
>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>>>>>> concrete proposal to include not only the set of changes to Iceberg
>>>>>> metadata that allow compatible query engines to interopate on
>>>>>> subcolumnarization data for variant columns, but also reference
>>>>>> documentation explaining subcolumnarization principles and recommended 
>>>>>> best
>>>>>> practices.
>>>>>> >
>>>>>> >
>>>>>> > It sounds like the recent Geo proposal [3] may be a good starting
>>>>>> point for how to approach this, so our plan is to write something up in
>>>>>> that vein that covers the proposed spec changes, backwards compatibility,
>>>>>> implementor burdens, etc. But we wanted to first reach out to the 
>>>>>> community
>>>>>> to introduce ourselves and the idea, and see if there’s any early 
>>>>>> feedback
>>>>>> we should incorporate before we spend too much time on a concrete 
>>>>>> proposal.
>>>>>> >
>>>>>> > Thank you!
>>>>>> >
>>>>>> > [1]
>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>> > [2]
>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>> > [3]
>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>> >
>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>> >
>>>>>>
>>>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to