Re: [Early Feedback] Variant and Subcolumnarization Support

Ryan Blue Fri, 12 Jul 2024 12:31:25 -0700

Oops, it looks like I missed where Aihua brought this up in his last email:


> do we have an issue to directly use Spark implementation in Iceberg?

Yes, I think that we do have an issue using the Spark library. What do you
think about a Java implementation in Iceberg?

Ryan

On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote:

> I raised the same point from Peter's email in a comment on the doc as
> well. There is a spark-variant_2.13 artifact that would be a much smaller
> scope than relying on large portions of Spark, but I even then I doubt that
> it is a good idea for Iceberg to depend on that because it is a Scala
> artifact and we would need to bring in a ton of Scala libs. I think what
> makes the most sense is to have an independent implementation of the spec
> in Iceberg.
>
> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Aihua,
>> Long time no see :)
>> Would this mean, that every engine which plans to support Variant data
>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>> Thanks, Peter
>>
>>
>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote:
>>
>>> Thanks Ryan.
>>>
>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>> compatibility for the open source engines.
>>>
>>> One more question regarding the encoding implementation: do we have an
>>> issue to directly use Spark implementation in Iceberg? Russell pointed out
>>> that Trino doesn't have Spark dependency and that could be a problem?
>>>
>>> Thanks,
>>> Aihua
>>>
>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>> > Thanks, Aihua!
>>> >
>>> > I think that the encoding choice in the current doc is a good one. I
>>> went
>>> > through the Spark encoding in detail and it looks like a better choice
>>> than
>>> > the other candidate encodings for quickly accessing nested fields.
>>> >
>>> > Another reason to use the Spark type is that this is what Delta's
>>> variant
>>> > type is based on, so Parquet files in tables written by Delta could be
>>> > converted or used in Iceberg tables without needing to rewrite variant
>>> > data. (Also, note that I work at Databricks and have an interest in
>>> > increasing format compatibility.)
>>> >
>>> > Ryan
>>> >
>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com
>>> .invalid>
>>> > wrote:
>>> >
>>> > > [Discuss] Consensus for Variant Encoding
>>> > >
>>> > > It’s great to be able to present the Variant type proposal in the
>>> > > community sync yesterday and I’m looking to host a meeting next week
>>> > > (targeting for 9am, July 17th) to go over any further concerns about
>>> the
>>> > > encoding of the Variant type and any other questions on the first
>>> phase of
>>> > > the proposal
>>> > > <
>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>> >.
>>> > > We are hoping that anyone who is interested in the proposal can
>>> either join
>>> > > or reply with their comments so we can discuss them. Summary of the
>>> > > discussion and notes will be sent to the mailing list for further
>>> comment
>>> > > there.
>>> > >
>>> > >
>>> > >    -
>>> > >
>>> > >    What should be the underlying binary representation
>>> > >
>>> > > We have evaluated a few encodings in the doc including ION, JSONB,
>>> and
>>> > > Spark encoding.Choosing the underlying encoding is an important
>>> first step
>>> > > here and we believe we have general support for Spark’s Variant
>>> encoding.
>>> > > We would like to hear if anyone else has strong opinions in this
>>> space.
>>> > >
>>> > >
>>> > >    -
>>> > >
>>> > >    Should we support multiple logical types or just Variant? Variant
>>> vs.
>>> > >    Variant + JSON.
>>> > >
>>> > > This is to discuss what logical data type(s) to be supported in
>>> Iceberg -
>>> > > Variant only vs. Variant + JSON. Both types would share the same
>>> underlying
>>> > > encoding but would imply different limitations on engines working
>>> with
>>> > > those types.
>>> > >
>>> > > From the sync up meeting, we are more favoring toward supporting
>>> Variant
>>> > > only and we want to have a consensus on the supported type(s).
>>> > >
>>> > >
>>> > >    -
>>> > >
>>> > >    How should we move forward with Subcolumnization?
>>> > >
>>> > > Subcolumnization is an optimization for Variant type by separating
>>> out
>>> > > subcolumns with their own metadata. This is not critical for
>>> choosing the
>>> > > initial encoding of the Variant type so we were hoping to gain
>>> consensus on
>>> > > leaving that for a follow up spec.
>>> > >
>>> > >
>>> > > Thanks
>>> > >
>>> > > Aihua
>>> > >
>>> > > Meeting invite:
>>> > >
>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>> > > Time zone: America/Los_Angeles
>>> > > Google Meet joining info
>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>> > >
>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com>
>>> wrote:
>>> > >
>>> > >> Hello,
>>> > >>
>>> > >> We have drafted the proposal
>>> > >> <
>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>> >
>>> > >> for Variant data type. Please help review and comment.
>>> > >>
>>> > >> Thanks,
>>> > >> Aihua
>>> > >>
>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com>
>>> wrote:
>>> > >>
>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>>> internally
>>> > >>> and a JSON type would really play well with for example the SUPER
>>> type in
>>> > >>> Redshift:
>>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>> and
>>> > >>> can also provide better integration with the Trino JSON type.
>>> > >>>
>>> > >>> Looking forward to the proposal!
>>> > >>>
>>> > >>> Best,
>>> > >>> Jack Ye
>>> > >>>
>>> > >>>
>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote:
>>> > >>>
>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote:
>>> > >>>>
>>> > >>>>> > We may need some guidance on just how many we need to look at;
>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much
>>> > >>>>> > further down the rabbit hole we needed to go。
>>> > >>>>>
>>> > >>>>> There are some engines living outside the Java world. It would be
>>> > >>>>> good if the proposal could cover the effort it takes to integrate
>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>>> something
>>> > >>>>> that
>>> > >>>>> some proprietary iceberg vendors also care about.
>>> > >>>>>
>>> > >>>>
>>> > >>>> Ack, makes sense. We can make sure to share some perspective on
>>> this.
>>> > >>>>
>>> > >>>> > Not necessarily, no. As long as there's a binary type and
>>> Iceberg and
>>> > >>>>> > the query engines are aware that the binary column needs to be
>>> > >>>>> > interpreted as a variant, that should be sufficient.
>>> > >>>>>
>>> > >>>>> From the perspective of interoperability, it would be good to
>>> support
>>> > >>>>> native
>>> > >>>>> type from file specs. Life will be easier for projects like
>>> Apache
>>> > >>>>> XTable.
>>> > >>>>> File format could also provide finer-grained statistics for
>>> variant
>>> > >>>>> type which
>>> > >>>>> facilitates data skipping.
>>> > >>>>>
>>> > >>>>
>>> > >>>> Agreed, there can definitely be additional value in native file
>>> format
>>> > >>>> integration. Just wanted to highlight that it's not a strict
>>> requirement.
>>> > >>>>
>>> > >>>> -Tyler
>>> > >>>>
>>> > >>>>
>>> > >>>>>
>>> > >>>>> Gang
>>> > >>>>>
>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>> > >>>>>
>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>> > >>>>>>
>>> > >>>>>> -Tyler
>>> > >>>>>>
>>> > >>>>>>
>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net>
>>> > >>>>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> Hi Tyler,
>>> > >>>>>>>
>>> > >>>>>>> Super happy to see you there :) It reminds me our discussions
>>> back in
>>> > >>>>>>> the start of Apache Beam :)
>>> > >>>>>>>
>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>>> discussions
>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is
>>> already
>>> > >>>>>>> supported in the spec v2.
>>> > >>>>>>>
>>> > >>>>>>> I'm looking forward to the proposal and happy to help on this !
>>> > >>>>>>>
>>> > >>>>>>> Regards
>>> > >>>>>>> JB
>>> > >>>>>>>
>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote:
>>> > >>>>>>> >
>>> > >>>>>>> > Hello,
>>> > >>>>>>> >
>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal
>>> for
>>> > >>>>>>> which we’d like to get early feedback from the community. As
>>> you may know,
>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
>>> Having made
>>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>>> we’re now in a
>>> > >>>>>>> position where there are features not yet supported in Iceberg
>>> which we
>>> > >>>>>>> think would be valuable for our users, and that we would like
>>> to discuss
>>> > >>>>>>> with and help contribute to the Iceberg community.
>>> > >>>>>>> >
>>> > >>>>>>> > The first two such features we’d like to discuss are in
>>> support of
>>> > >>>>>>> efficient querying of dynamically typed, semi-structured data:
>>> variant data
>>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>>> detail, for
>>> > >>>>>>> anyone who may not already be familiar:
>>> > >>>>>>> >
>>> > >>>>>>> > 1. Variant data types
>>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>>> dynamic
>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>>> semi-structured
>>> > >>>>>>> data as a variant column, we retain the flexibility of the
>>> source data,
>>> > >>>>>>> while allowing query engines to more efficiently operate on
>>> the data.
>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>>> tables for many
>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>>> Snowflake,
>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>>> support.
>>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>>> begun adding
>>> > >>>>>>> variant support [2]. As such, we believe it would be
>>> beneficial to the
>>> > >>>>>>> Iceberg community as a whole to standardize on the variant
>>> data type
>>> > >>>>>>> encoding used across Iceberg tables.
>>> > >>>>>>> >
>>> > >>>>>>> > One specific point to make here is that, since an Apache OSS
>>> > >>>>>>> version of variant encoding already exists in Spark, it likely
>>> makes sense
>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as
>>> well. The
>>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>>> different, but
>>> > >>>>>>> essentially equivalent, and we see no particular value in
>>> trying to clutter
>>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>>> > >>>>>>> >
>>> > >>>>>>> >
>>> > >>>>>>> > 2. Subcolumnarization
>>> > >>>>>>> > Subcolumnarization of variant columns allows query engines to
>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>>> fields) within a
>>> > >>>>>>> variant column are queried, and also allows optionally
>>> materializing some
>>> > >>>>>>> of the nested fields as a column on their own, affording
>>> queries on these
>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU on
>>> extraction.
>>> > >>>>>>> When subcolumnarizing, the system managing table metadata and
>>> data tracks
>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some
>>> subset of the
>>> > >>>>>>> nested fields within a variant, and also manages any optional
>>> > >>>>>>> materialization. Without subcolumnarization, any query which
>>> touches a
>>> > >>>>>>> variant column must read, parse, extract, and filter every row
>>> for which
>>> > >>>>>>> that column is non-null. Thus, by providing a standardized way
>>> of tracking
>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can
>>> make
>>> > >>>>>>> subcolumnar optimizations accessible across various catalogs
>>> and query
>>> > >>>>>>> engines.
>>> > >>>>>>> >
>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>>> > >>>>>>> concrete proposal to include not only the set of changes to
>>> Iceberg
>>> > >>>>>>> metadata that allow compatible query engines to interopate on
>>> > >>>>>>> subcolumnarization data for variant columns, but also reference
>>> > >>>>>>> documentation explaining subcolumnarization principles and
>>> recommended best
>>> > >>>>>>> practices.
>>> > >>>>>>> >
>>> > >>>>>>> >
>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>>> starting
>>> > >>>>>>> point for how to approach this, so our plan is to write
>>> something up in
>>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>>> compatibility,
>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to
>>> the community
>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>>> early feedback
>>> > >>>>>>> we should incorporate before we spend too much time on a
>>> concrete proposal.
>>> > >>>>>>> >
>>> > >>>>>>> > Thank you!
>>> > >>>>>>> >
>>> > >>>>>>> > [1]
>>> > >>>>>>>
>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>> > >>>>>>> > [2]
>>> > >>>>>>>
>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>> > >>>>>>> > [3]
>>> > >>>>>>>
>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>> > >>>>>>> >
>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>> > >>>>>>> >
>>> > >>>>>>>
>>> > >>>>>>
>>> >
>>> > --
>>> > Ryan Blue
>>> > Databricks
>>> >
>>>
>>
>
> --
> Ryan Blue
> Databricks
>


-- 
Ryan Blue
Databricks

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to