Re: [Early Feedback] Variant and Subcolumnarization Support

Péter Váry Fri, 12 Jul 2024 11:52:12 -0700

Hi Aihua,
Long time no see :)
Would this mean, that every engine which plans to support Variant data type
needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
Thanks, Peter



On Fri, Jul 12, 2024, 19:10 Aihua Xu <[email protected]> wrote:

> Thanks Ryan.
>
> Yeah. That's another reason we want to pursue Spark encoding to keep
> compatibility for the open source engines.
>
> One more question regarding the encoding implementation: do we have an
> issue to directly use Spark implementation in Iceberg? Russell pointed out
> that Trino doesn't have Spark dependency and that could be a problem?
>
> Thanks,
> Aihua
>
> On 2024/07/12 15:02:06 Ryan Blue wrote:
> > Thanks, Aihua!
> >
> > I think that the encoding choice in the current doc is a good one. I went
> > through the Spark encoding in detail and it looks like a better choice
> than
> > the other candidate encodings for quickly accessing nested fields.
> >
> > Another reason to use the Spark type is that this is what Delta's variant
> > type is based on, so Parquet files in tables written by Delta could be
> > converted or used in Iceberg tables without needing to rewrite variant
> > data. (Also, note that I work at Databricks and have an interest in
> > increasing format compatibility.)
> >
> > Ryan
> >
> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <[email protected]
> .invalid>
> > wrote:
> >
> > > [Discuss] Consensus for Variant Encoding
> > >
> > > It’s great to be able to present the Variant type proposal in the
> > > community sync yesterday and I’m looking to host a meeting next week
> > > (targeting for 9am, July 17th) to go over any further concerns about
> the
> > > encoding of the Variant type and any other questions on the first
> phase of
> > > the proposal
> > > <
> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
> >.
> > > We are hoping that anyone who is interested in the proposal can either
> join
> > > or reply with their comments so we can discuss them. Summary of the
> > > discussion and notes will be sent to the mailing list for further
> comment
> > > there.
> > >
> > >
> > >    -
> > >
> > >    What should be the underlying binary representation
> > >
> > > We have evaluated a few encodings in the doc including ION, JSONB, and
> > > Spark encoding.Choosing the underlying encoding is an important first
> step
> > > here and we believe we have general support for Spark’s Variant
> encoding.
> > > We would like to hear if anyone else has strong opinions in this space.
> > >
> > >
> > >    -
> > >
> > >    Should we support multiple logical types or just Variant? Variant
> vs.
> > >    Variant + JSON.
> > >
> > > This is to discuss what logical data type(s) to be supported in
> Iceberg -
> > > Variant only vs. Variant + JSON. Both types would share the same
> underlying
> > > encoding but would imply different limitations on engines working with
> > > those types.
> > >
> > > From the sync up meeting, we are more favoring toward supporting
> Variant
> > > only and we want to have a consensus on the supported type(s).
> > >
> > >
> > >    -
> > >
> > >    How should we move forward with Subcolumnization?
> > >
> > > Subcolumnization is an optimization for Variant type by separating out
> > > subcolumns with their own metadata. This is not critical for choosing
> the
> > > initial encoding of the Variant type so we were hoping to gain
> consensus on
> > > leaving that for a follow up spec.
> > >
> > >
> > > Thanks
> > >
> > > Aihua
> > >
> > > Meeting invite:
> > >
> > > Wednesday, July 17 · 9:00 – 10:00am
> > > Time zone: America/Los_Angeles
> > > Google Meet joining info
> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
> > >
> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <[email protected]>
> wrote:
> > >
> > >> Hello,
> > >>
> > >> We have drafted the proposal
> > >> <
> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
> >
> > >> for Variant data type. Please help review and comment.
> > >>
> > >> Thanks,
> > >> Aihua
> > >>
> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <[email protected]> wrote:
> > >>
> > >>> +10000 for a JSON/BSON type. We also had the same discussion
> internally
> > >>> and a JSON type would really play well with for example the SUPER
> type in
> > >>> Redshift:
> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
> and
> > >>> can also provide better integration with the Trino JSON type.
> > >>>
> > >>> Looking forward to the proposal!
> > >>>
> > >>> Best,
> > >>> Jack Ye
> > >>>
> > >>>
> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
> > >>> <[email protected]> wrote:
> > >>>
> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]> wrote:
> > >>>>
> > >>>>> > We may need some guidance on just how many we need to look at;
> > >>>>> > we were planning on Spark and Trino, but weren't sure how much
> > >>>>> > further down the rabbit hole we needed to go。
> > >>>>>
> > >>>>> There are some engines living outside the Java world. It would be
> > >>>>> good if the proposal could cover the effort it takes to integrate
> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
> something
> > >>>>> that
> > >>>>> some proprietary iceberg vendors also care about.
> > >>>>>
> > >>>>
> > >>>> Ack, makes sense. We can make sure to share some perspective on
> this.
> > >>>>
> > >>>> > Not necessarily, no. As long as there's a binary type and Iceberg
> and
> > >>>>> > the query engines are aware that the binary column needs to be
> > >>>>> > interpreted as a variant, that should be sufficient.
> > >>>>>
> > >>>>> From the perspective of interoperability, it would be good to
> support
> > >>>>> native
> > >>>>> type from file specs. Life will be easier for projects like Apache
> > >>>>> XTable.
> > >>>>> File format could also provide finer-grained statistics for variant
> > >>>>> type which
> > >>>>> facilitates data skipping.
> > >>>>>
> > >>>>
> > >>>> Agreed, there can definitely be additional value in native file
> format
> > >>>> integration. Just wanted to highlight that it's not a strict
> requirement.
> > >>>>
> > >>>> -Tyler
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> Gang
> > >>>>>
> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
> > >>>>> <[email protected]> wrote:
> > >>>>>
> > >>>>>> Good to see you again as well, JB! Thanks!
> > >>>>>>
> > >>>>>> -Tyler
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Tyler,
> > >>>>>>>
> > >>>>>>> Super happy to see you there :) It reminds me our discussions
> back in
> > >>>>>>> the start of Apache Beam :)
> > >>>>>>>
> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
> discussions
> > >>>>>>> about JSON datatype for spec v3. The binary data type is already
> > >>>>>>> supported in the spec v2.
> > >>>>>>>
> > >>>>>>> I'm looking forward to the proposal and happy to help on this !
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>> >
> > >>>>>>> > Hello,
> > >>>>>>> >
> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal
> for
> > >>>>>>> which we’d like to get early feedback from the community. As you
> may know,
> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
> Having made
> > >>>>>>> good progress on our own adoption of the Iceberg standard, we’re
> now in a
> > >>>>>>> position where there are features not yet supported in Iceberg
> which we
> > >>>>>>> think would be valuable for our users, and that we would like to
> discuss
> > >>>>>>> with and help contribute to the Iceberg community.
> > >>>>>>> >
> > >>>>>>> > The first two such features we’d like to discuss are in
> support of
> > >>>>>>> efficient querying of dynamically typed, semi-structured data:
> variant data
> > >>>>>>> types, and subcolumnarization of variant columns. In more
> detail, for
> > >>>>>>> anyone who may not already be familiar:
> > >>>>>>> >
> > >>>>>>> > 1. Variant data types
> > >>>>>>> > Variant types allow for the efficient binary encoding of
> dynamic
> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
> semi-structured
> > >>>>>>> data as a variant column, we retain the flexibility of the
> source data,
> > >>>>>>> while allowing query engines to more efficiently operate on the
> data.
> > >>>>>>> Snowflake has supported the variant data type on Snowflake
> tables for many
> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
> Snowflake,
> > >>>>>>> we’re hearing an increasing chorus of requests for variant
> support.
> > >>>>>>> Additionally, other query engines such as Apache Spark have
> begun adding
> > >>>>>>> variant support [2]. As such, we believe it would be beneficial
> to the
> > >>>>>>> Iceberg community as a whole to standardize on the variant data
> type
> > >>>>>>> encoding used across Iceberg tables.
> > >>>>>>> >
> > >>>>>>> > One specific point to make here is that, since an Apache OSS
> > >>>>>>> version of variant encoding already exists in Spark, it likely
> makes sense
> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as
> well. The
> > >>>>>>> encoding we use internally today in Snowflake is slightly
> different, but
> > >>>>>>> essentially equivalent, and we see no particular value in trying
> to clutter
> > >>>>>>> the space with another equivalent-but-incompatible encoding.
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > 2. Subcolumnarization
> > >>>>>>> > Subcolumnarization of variant columns allows query engines to
> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested fields)
> within a
> > >>>>>>> variant column are queried, and also allows optionally
> materializing some
> > >>>>>>> of the nested fields as a column on their own, affording queries
> on these
> > >>>>>>> subcolumns the ability to read less data and spend less CPU on
> extraction.
> > >>>>>>> When subcolumnarizing, the system managing table metadata and
> data tracks
> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some
> subset of the
> > >>>>>>> nested fields within a variant, and also manages any optional
> > >>>>>>> materialization. Without subcolumnarization, any query which
> touches a
> > >>>>>>> variant column must read, parse, extract, and filter every row
> for which
> > >>>>>>> that column is non-null. Thus, by providing a standardized way
> of tracking
> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can make
> > >>>>>>> subcolumnar optimizations accessible across various catalogs and
> query
> > >>>>>>> engines.
> > >>>>>>> >
> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
> > >>>>>>> concrete proposal to include not only the set of changes to
> Iceberg
> > >>>>>>> metadata that allow compatible query engines to interopate on
> > >>>>>>> subcolumnarization data for variant columns, but also reference
> > >>>>>>> documentation explaining subcolumnarization principles and
> recommended best
> > >>>>>>> practices.
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
> starting
> > >>>>>>> point for how to approach this, so our plan is to write
> something up in
> > >>>>>>> that vein that covers the proposed spec changes, backwards
> compatibility,
> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to
> the community
> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
> early feedback
> > >>>>>>> we should incorporate before we spend too much time on a
> concrete proposal.
> > >>>>>>> >
> > >>>>>>> > Thank you!
> > >>>>>>> >
> > >>>>>>> > [1]
> > >>>>>>>
> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
> > >>>>>>> > [2]
> > >>>>>>>
> https://github.com/apache/spark/blob/master/common/variant/README.md
> > >>>>>>> > [3]
> > >>>>>>>
> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
> > >>>>>>> >
> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>
> >
> > --
> > Ryan Blue
> > Databricks
> >
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to