Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Fri, 12 Jul 2024 10:10:23 -0700

Thanks Ryan.

Yeah. That's another reason we want to pursue Spark encoding to keep 
compatibility for the open source engines.


One more question regarding the encoding implementation: do we have an issue to 
directly use Spark implementation in Iceberg? Russell pointed out that Trino 
doesn't have Spark dependency and that could be a problem?

Thanks,
Aihua

On 2024/07/12 15:02:06 Ryan Blue wrote:
> Thanks, Aihua!
> 
> I think that the encoding choice in the current doc is a good one. I went
> through the Spark encoding in detail and it looks like a better choice than
> the other candidate encodings for quickly accessing nested fields.
> 
> Another reason to use the Spark type is that this is what Delta's variant
> type is based on, so Parquet files in tables written by Delta could be
> converted or used in Iceberg tables without needing to rewrite variant
> data. (Also, note that I work at Databricks and have an interest in
> increasing format compatibility.)
> 
> Ryan
> 
> On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <[email protected]>
> wrote:
> 
> > [Discuss] Consensus for Variant Encoding
> >
> > It’s great to be able to present the Variant type proposal in the
> > community sync yesterday and I’m looking to host a meeting next week
> > (targeting for 9am, July 17th) to go over any further concerns about the
> > encoding of the Variant type and any other questions on the first phase of
> > the proposal
> > <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>.
> > We are hoping that anyone who is interested in the proposal can either join
> > or reply with their comments so we can discuss them. Summary of the
> > discussion and notes will be sent to the mailing list for further comment
> > there.
> >
> >
> >    -
> >
> >    What should be the underlying binary representation
> >
> > We have evaluated a few encodings in the doc including ION, JSONB, and
> > Spark encoding.Choosing the underlying encoding is an important first step
> > here and we believe we have general support for Spark’s Variant encoding.
> > We would like to hear if anyone else has strong opinions in this space.
> >
> >
> >    -
> >
> >    Should we support multiple logical types or just Variant? Variant vs.
> >    Variant + JSON.
> >
> > This is to discuss what logical data type(s) to be supported in Iceberg -
> > Variant only vs. Variant + JSON. Both types would share the same underlying
> > encoding but would imply different limitations on engines working with
> > those types.
> >
> > From the sync up meeting, we are more favoring toward supporting Variant
> > only and we want to have a consensus on the supported type(s).
> >
> >
> >    -
> >
> >    How should we move forward with Subcolumnization?
> >
> > Subcolumnization is an optimization for Variant type by separating out
> > subcolumns with their own metadata. This is not critical for choosing the
> > initial encoding of the Variant type so we were hoping to gain consensus on
> > leaving that for a follow up spec.
> >
> >
> > Thanks
> >
> > Aihua
> >
> > Meeting invite:
> >
> > Wednesday, July 17 · 9:00 – 10:00am
> > Time zone: America/Los_Angeles
> > Google Meet joining info
> > Video call link: https://meet.google.com/pbm-ovzn-aoq
> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
> >
> > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> We have drafted the proposal
> >> <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>
> >> for Variant data type. Please help review and comment.
> >>
> >> Thanks,
> >> Aihua
> >>
> >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <[email protected]> wrote:
> >>
> >>> +10000 for a JSON/BSON type. We also had the same discussion internally
> >>> and a JSON type would really play well with for example the SUPER type in
> >>> Redshift:
> >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and
> >>> can also provide better integration with the Trino JSON type.
> >>>
> >>> Looking forward to the proposal!
> >>>
> >>> Best,
> >>> Jack Ye
> >>>
> >>>
> >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
> >>> <[email protected]> wrote:
> >>>
> >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]> wrote:
> >>>>
> >>>>> > We may need some guidance on just how many we need to look at;
> >>>>> > we were planning on Spark and Trino, but weren't sure how much
> >>>>> > further down the rabbit hole we needed to go。
> >>>>>
> >>>>> There are some engines living outside the Java world. It would be
> >>>>> good if the proposal could cover the effort it takes to integrate
> >>>>> variant type to them (e.g. velox, datafusion, etc.). This is something
> >>>>> that
> >>>>> some proprietary iceberg vendors also care about.
> >>>>>
> >>>>
> >>>> Ack, makes sense. We can make sure to share some perspective on this.
> >>>>
> >>>> > Not necessarily, no. As long as there's a binary type and Iceberg and
> >>>>> > the query engines are aware that the binary column needs to be
> >>>>> > interpreted as a variant, that should be sufficient.
> >>>>>
> >>>>> From the perspective of interoperability, it would be good to support
> >>>>> native
> >>>>> type from file specs. Life will be easier for projects like Apache
> >>>>> XTable.
> >>>>> File format could also provide finer-grained statistics for variant
> >>>>> type which
> >>>>> facilitates data skipping.
> >>>>>
> >>>>
> >>>> Agreed, there can definitely be additional value in native file format
> >>>> integration. Just wanted to highlight that it's not a strict requirement.
> >>>>
> >>>> -Tyler
> >>>>
> >>>>
> >>>>>
> >>>>> Gang
> >>>>>
> >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
> >>>>> <[email protected]> wrote:
> >>>>>
> >>>>>> Good to see you again as well, JB! Thanks!
> >>>>>>
> >>>>>> -Tyler
> >>>>>>
> >>>>>>
> >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré 
> >>>>>> <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi Tyler,
> >>>>>>>
> >>>>>>> Super happy to see you there :) It reminds me our discussions back in
> >>>>>>> the start of Apache Beam :)
> >>>>>>>
> >>>>>>> Anyway, the thread is pretty interesting. I remember some discussions
> >>>>>>> about JSON datatype for spec v3. The binary data type is already
> >>>>>>> supported in the spec v2.
> >>>>>>>
> >>>>>>> I'm looking forward to the proposal and happy to help on this !
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> JB
> >>>>>>>
> >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
> >>>>>>> <[email protected]> wrote:
> >>>>>>> >
> >>>>>>> > Hello,
> >>>>>>> >
> >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for
> >>>>>>> which we’d like to get early feedback from the community. As you may 
> >>>>>>> know,
> >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having 
> >>>>>>> made
> >>>>>>> good progress on our own adoption of the Iceberg standard, we’re now 
> >>>>>>> in a
> >>>>>>> position where there are features not yet supported in Iceberg which 
> >>>>>>> we
> >>>>>>> think would be valuable for our users, and that we would like to 
> >>>>>>> discuss
> >>>>>>> with and help contribute to the Iceberg community.
> >>>>>>> >
> >>>>>>> > The first two such features we’d like to discuss are in support of
> >>>>>>> efficient querying of dynamically typed, semi-structured data: 
> >>>>>>> variant data
> >>>>>>> types, and subcolumnarization of variant columns. In more detail, for
> >>>>>>> anyone who may not already be familiar:
> >>>>>>> >
> >>>>>>> > 1. Variant data types
> >>>>>>> > Variant types allow for the efficient binary encoding of dynamic
> >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding 
> >>>>>>> semi-structured
> >>>>>>> data as a variant column, we retain the flexibility of the source 
> >>>>>>> data,
> >>>>>>> while allowing query engines to more efficiently operate on the data.
> >>>>>>> Snowflake has supported the variant data type on Snowflake tables for 
> >>>>>>> many
> >>>>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake,
> >>>>>>> we’re hearing an increasing chorus of requests for variant support.
> >>>>>>> Additionally, other query engines such as Apache Spark have begun 
> >>>>>>> adding
> >>>>>>> variant support [2]. As such, we believe it would be beneficial to the
> >>>>>>> Iceberg community as a whole to standardize on the variant data type
> >>>>>>> encoding used across Iceberg tables.
> >>>>>>> >
> >>>>>>> > One specific point to make here is that, since an Apache OSS
> >>>>>>> version of variant encoding already exists in Spark, it likely makes 
> >>>>>>> sense
> >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as well. 
> >>>>>>> The
> >>>>>>> encoding we use internally today in Snowflake is slightly different, 
> >>>>>>> but
> >>>>>>> essentially equivalent, and we see no particular value in trying to 
> >>>>>>> clutter
> >>>>>>> the space with another equivalent-but-incompatible encoding.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > 2. Subcolumnarization
> >>>>>>> > Subcolumnarization of variant columns allows query engines to
> >>>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) 
> >>>>>>> within a
> >>>>>>> variant column are queried, and also allows optionally materializing 
> >>>>>>> some
> >>>>>>> of the nested fields as a column on their own, affording queries on 
> >>>>>>> these
> >>>>>>> subcolumns the ability to read less data and spend less CPU on 
> >>>>>>> extraction.
> >>>>>>> When subcolumnarizing, the system managing table metadata and data 
> >>>>>>> tracks
> >>>>>>> individual pruning statistics (min, max, null, etc.) for some subset 
> >>>>>>> of the
> >>>>>>> nested fields within a variant, and also manages any optional
> >>>>>>> materialization. Without subcolumnarization, any query which touches a
> >>>>>>> variant column must read, parse, extract, and filter every row for 
> >>>>>>> which
> >>>>>>> that column is non-null. Thus, by providing a standardized way of 
> >>>>>>> tracking
> >>>>>>> subcolum metadata and data for variant columns, Iceberg can make
> >>>>>>> subcolumnar optimizations accessible across various catalogs and query
> >>>>>>> engines.
> >>>>>>> >
> >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
> >>>>>>> concrete proposal to include not only the set of changes to Iceberg
> >>>>>>> metadata that allow compatible query engines to interopate on
> >>>>>>> subcolumnarization data for variant columns, but also reference
> >>>>>>> documentation explaining subcolumnarization principles and 
> >>>>>>> recommended best
> >>>>>>> practices.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > It sounds like the recent Geo proposal [3] may be a good starting
> >>>>>>> point for how to approach this, so our plan is to write something up 
> >>>>>>> in
> >>>>>>> that vein that covers the proposed spec changes, backwards 
> >>>>>>> compatibility,
> >>>>>>> implementor burdens, etc. But we wanted to first reach out to the 
> >>>>>>> community
> >>>>>>> to introduce ourselves and the idea, and see if there’s any early 
> >>>>>>> feedback
> >>>>>>> we should incorporate before we spend too much time on a concrete 
> >>>>>>> proposal.
> >>>>>>> >
> >>>>>>> > Thank you!
> >>>>>>> >
> >>>>>>> > [1]
> >>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
> >>>>>>> > [2]
> >>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
> >>>>>>> > [3]
> >>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
> >>>>>>> >
> >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
> >>>>>>> >
> >>>>>>>
> >>>>>>
> 
> -- 
> Ryan Blue
> Databricks
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to