Re: [Early Feedback] Variant and Subcolumnarization Support

Russell Spitzer Fri, 12 Jul 2024 13:09:53 -0700

That's fair, I'm sold on an Iceberg Module.

On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue <[email protected]>
wrote:


> > Feels like eventually the encoding should land in parquet proper right?
>
> What about using it in ORC? I don't know where it should end up. Maybe
> Iceberg should make a standalone module from it?
>
> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
> [email protected]> wrote:
>
>> Feels like eventually the encoding should land in parquet proper right?
>> I'm fine with us just copying into Iceberg though for the time being.
>>
>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <[email protected]>
>> wrote:
>>
>>> Oops, it looks like I missed where Aihua brought this up in his last
>>> email:
>>>
>>> > do we have an issue to directly use Spark implementation in Iceberg?
>>>
>>> Yes, I think that we do have an issue using the Spark library. What do
>>> you think about a Java implementation in Iceberg?
>>>
>>> Ryan
>>>
>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> I raised the same point from Peter's email in a comment on the doc as
>>>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>>>> scope than relying on large portions of Spark, but I even then I doubt that
>>>> it is a good idea for Iceberg to depend on that because it is a Scala
>>>> artifact and we would need to bring in a ton of Scala libs. I think what
>>>> makes the most sense is to have an independent implementation of the spec
>>>> in Iceberg.
>>>>
>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Aihua,
>>>>> Long time no see :)
>>>>> Would this mean, that every engine which plans to support Variant data
>>>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>>>> Thanks, Peter
>>>>>
>>>>>
>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <[email protected]> wrote:
>>>>>
>>>>>> Thanks Ryan.
>>>>>>
>>>>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>>>>> compatibility for the open source engines.
>>>>>>
>>>>>> One more question regarding the encoding implementation: do we have
>>>>>> an issue to directly use Spark implementation in Iceberg? Russell pointed
>>>>>> out that Trino doesn't have Spark dependency and that could be a problem?
>>>>>>
>>>>>> Thanks,
>>>>>> Aihua
>>>>>>
>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>> > Thanks, Aihua!
>>>>>> >
>>>>>> > I think that the encoding choice in the current doc is a good one.
>>>>>> I went
>>>>>> > through the Spark encoding in detail and it looks like a better
>>>>>> choice than
>>>>>> > the other candidate encodings for quickly accessing nested fields.
>>>>>> >
>>>>>> > Another reason to use the Spark type is that this is what Delta's
>>>>>> variant
>>>>>> > type is based on, so Parquet files in tables written by Delta could
>>>>>> be
>>>>>> > converted or used in Iceberg tables without needing to rewrite
>>>>>> variant
>>>>>> > data. (Also, note that I work at Databricks and have an interest in
>>>>>> > increasing format compatibility.)
>>>>>> >
>>>>>> > Ryan
>>>>>> >
>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <[email protected]
>>>>>> .invalid>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>> > >
>>>>>> > > It’s great to be able to present the Variant type proposal in the
>>>>>> > > community sync yesterday and I’m looking to host a meeting next
>>>>>> week
>>>>>> > > (targeting for 9am, July 17th) to go over any further concerns
>>>>>> about the
>>>>>> > > encoding of the Variant type and any other questions on the first
>>>>>> phase of
>>>>>> > > the proposal
>>>>>> > > <
>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>> >.
>>>>>> > > We are hoping that anyone who is interested in the proposal can
>>>>>> either join
>>>>>> > > or reply with their comments so we can discuss them. Summary of
>>>>>> the
>>>>>> > > discussion and notes will be sent to the mailing list for further
>>>>>> comment
>>>>>> > > there.
>>>>>> > >
>>>>>> > >
>>>>>> > >    -
>>>>>> > >
>>>>>> > >    What should be the underlying binary representation
>>>>>> > >
>>>>>> > > We have evaluated a few encodings in the doc including ION,
>>>>>> JSONB, and
>>>>>> > > Spark encoding.Choosing the underlying encoding is an important
>>>>>> first step
>>>>>> > > here and we believe we have general support for Spark’s Variant
>>>>>> encoding.
>>>>>> > > We would like to hear if anyone else has strong opinions in this
>>>>>> space.
>>>>>> > >
>>>>>> > >
>>>>>> > >    -
>>>>>> > >
>>>>>> > >    Should we support multiple logical types or just Variant?
>>>>>> Variant vs.
>>>>>> > >    Variant + JSON.
>>>>>> > >
>>>>>> > > This is to discuss what logical data type(s) to be supported in
>>>>>> Iceberg -
>>>>>> > > Variant only vs. Variant + JSON. Both types would share the same
>>>>>> underlying
>>>>>> > > encoding but would imply different limitations on engines working
>>>>>> with
>>>>>> > > those types.
>>>>>> > >
>>>>>> > > From the sync up meeting, we are more favoring toward supporting
>>>>>> Variant
>>>>>> > > only and we want to have a consensus on the supported type(s).
>>>>>> > >
>>>>>> > >
>>>>>> > >    -
>>>>>> > >
>>>>>> > >    How should we move forward with Subcolumnization?
>>>>>> > >
>>>>>> > > Subcolumnization is an optimization for Variant type by
>>>>>> separating out
>>>>>> > > subcolumns with their own metadata. This is not critical for
>>>>>> choosing the
>>>>>> > > initial encoding of the Variant type so we were hoping to gain
>>>>>> consensus on
>>>>>> > > leaving that for a follow up spec.
>>>>>> > >
>>>>>> > >
>>>>>> > > Thanks
>>>>>> > >
>>>>>> > > Aihua
>>>>>> > >
>>>>>> > > Meeting invite:
>>>>>> > >
>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>>> > > Time zone: America/Los_Angeles
>>>>>> > > Google Meet joining info
>>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>>> > > More phone numbers:
>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>> > >
>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <[email protected]>
>>>>>> wrote:
>>>>>> > >
>>>>>> > >> Hello,
>>>>>> > >>
>>>>>> > >> We have drafted the proposal
>>>>>> > >> <
>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>> >
>>>>>> > >> for Variant data type. Please help review and comment.
>>>>>> > >>
>>>>>> > >> Thanks,
>>>>>> > >> Aihua
>>>>>> > >>
>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <[email protected]>
>>>>>> wrote:
>>>>>> > >>
>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>>>>>> internally
>>>>>> > >>> and a JSON type would really play well with for example the
>>>>>> SUPER type in
>>>>>> > >>> Redshift:
>>>>>> > >>>
>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and
>>>>>> > >>> can also provide better integration with the Trino JSON type.
>>>>>> > >>>
>>>>>> > >>> Looking forward to the proposal!
>>>>>> > >>>
>>>>>> > >>> Best,
>>>>>> > >>> Jack Ye
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>>>> > >>> <[email protected]> wrote:
>>>>>> > >>>
>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]>
>>>>>> wrote:
>>>>>> > >>>>
>>>>>> > >>>>> > We may need some guidance on just how many we need to look
>>>>>> at;
>>>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how
>>>>>> much
>>>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>>>> > >>>>>
>>>>>> > >>>>> There are some engines living outside the Java world. It
>>>>>> would be
>>>>>> > >>>>> good if the proposal could cover the effort it takes to
>>>>>> integrate
>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>>>>>> something
>>>>>> > >>>>> that
>>>>>> > >>>>> some proprietary iceberg vendors also care about.
>>>>>> > >>>>>
>>>>>> > >>>>
>>>>>> > >>>> Ack, makes sense. We can make sure to share some perspective
>>>>>> on this.
>>>>>> > >>>>
>>>>>> > >>>> > Not necessarily, no. As long as there's a binary type and
>>>>>> Iceberg and
>>>>>> > >>>>> > the query engines are aware that the binary column needs to
>>>>>> be
>>>>>> > >>>>> > interpreted as a variant, that should be sufficient.
>>>>>> > >>>>>
>>>>>> > >>>>> From the perspective of interoperability, it would be good to
>>>>>> support
>>>>>> > >>>>> native
>>>>>> > >>>>> type from file specs. Life will be easier for projects like
>>>>>> Apache
>>>>>> > >>>>> XTable.
>>>>>> > >>>>> File format could also provide finer-grained statistics for
>>>>>> variant
>>>>>> > >>>>> type which
>>>>>> > >>>>> facilitates data skipping.
>>>>>> > >>>>>
>>>>>> > >>>>
>>>>>> > >>>> Agreed, there can definitely be additional value in native
>>>>>> file format
>>>>>> > >>>> integration. Just wanted to highlight that it's not a strict
>>>>>> requirement.
>>>>>> > >>>>
>>>>>> > >>>> -Tyler
>>>>>> > >>>>
>>>>>> > >>>>
>>>>>> > >>>>>
>>>>>> > >>>>> Gang
>>>>>> > >>>>>
>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>>>> > >>>>> <[email protected]> wrote:
>>>>>> > >>>>>
>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>>>> > >>>>>>
>>>>>> > >>>>>> -Tyler
>>>>>> > >>>>>>
>>>>>> > >>>>>>
>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>>>>>> [email protected]>
>>>>>> > >>>>>> wrote:
>>>>>> > >>>>>>
>>>>>> > >>>>>>> Hi Tyler,
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Super happy to see you there :) It reminds me our
>>>>>> discussions back in
>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>>>>>> discussions
>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is
>>>>>> already
>>>>>> > >>>>>>> supported in the spec v2.
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on
>>>>>> this !
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> Regards
>>>>>> > >>>>>>> JB
>>>>>> > >>>>>>>
>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>>>> > >>>>>>> <[email protected]> wrote:
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > Hello,
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a
>>>>>> proposal for
>>>>>> > >>>>>>> which we’d like to get early feedback from the community.
>>>>>> As you may know,
>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake
>>>>>> format. Having made
>>>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>>>>>> we’re now in a
>>>>>> > >>>>>>> position where there are features not yet supported in
>>>>>> Iceberg which we
>>>>>> > >>>>>>> think would be valuable for our users, and that we would
>>>>>> like to discuss
>>>>>> > >>>>>>> with and help contribute to the Iceberg community.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > The first two such features we’d like to discuss are in
>>>>>> support of
>>>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured
>>>>>> data: variant data
>>>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>>>>>> detail, for
>>>>>> > >>>>>>> anyone who may not already be familiar:
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > 1. Variant data types
>>>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>>>>>> dynamic
>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>>>>>> semi-structured
>>>>>> > >>>>>>> data as a variant column, we retain the flexibility of the
>>>>>> source data,
>>>>>> > >>>>>>> while allowing query engines to more efficiently operate on
>>>>>> the data.
>>>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>>>>>> tables for many
>>>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>>>>>> Snowflake,
>>>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>>>>>> support.
>>>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>>>>>> begun adding
>>>>>> > >>>>>>> variant support [2]. As such, we believe it would be
>>>>>> beneficial to the
>>>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant
>>>>>> data type
>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > One specific point to make here is that, since an Apache
>>>>>> OSS
>>>>>> > >>>>>>> version of variant encoding already exists in Spark, it
>>>>>> likely makes sense
>>>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard
>>>>>> as well. The
>>>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>>>>>> different, but
>>>>>> > >>>>>>> essentially equivalent, and we see no particular value in
>>>>>> trying to clutter
>>>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>>> > >>>>>>> > Subcolumnarization of variant columns allows query
>>>>>> engines to
>>>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>>>>>> fields) within a
>>>>>> > >>>>>>> variant column are queried, and also allows optionally
>>>>>> materializing some
>>>>>> > >>>>>>> of the nested fields as a column on their own, affording
>>>>>> queries on these
>>>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU
>>>>>> on extraction.
>>>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata
>>>>>> and data tracks
>>>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for
>>>>>> some subset of the
>>>>>> > >>>>>>> nested fields within a variant, and also manages any
>>>>>> optional
>>>>>> > >>>>>>> materialization. Without subcolumnarization, any query
>>>>>> which touches a
>>>>>> > >>>>>>> variant column must read, parse, extract, and filter every
>>>>>> row for which
>>>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized
>>>>>> way of tracking
>>>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can
>>>>>> make
>>>>>> > >>>>>>> subcolumnar optimizations accessible across various
>>>>>> catalogs and query
>>>>>> > >>>>>>> engines.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect
>>>>>> any
>>>>>> > >>>>>>> concrete proposal to include not only the set of changes to
>>>>>> Iceberg
>>>>>> > >>>>>>> metadata that allow compatible query engines to interopate
>>>>>> on
>>>>>> > >>>>>>> subcolumnarization data for variant columns, but also
>>>>>> reference
>>>>>> > >>>>>>> documentation explaining subcolumnarization principles and
>>>>>> recommended best
>>>>>> > >>>>>>> practices.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>>>>>> starting
>>>>>> > >>>>>>> point for how to approach this, so our plan is to write
>>>>>> something up in
>>>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>>>>>> compatibility,
>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out
>>>>>> to the community
>>>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>>>>>> early feedback
>>>>>> > >>>>>>> we should incorporate before we spend too much time on a
>>>>>> concrete proposal.
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > Thank you!
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > [1]
>>>>>> > >>>>>>>
>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>> > >>>>>>> > [2]
>>>>>> > >>>>>>>
>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>> > >>>>>>> > [3]
>>>>>> > >>>>>>>
>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>> > >>>>>>> >
>>>>>> > >>>>>>>
>>>>>> > >>>>>>
>>>>>> >
>>>>>> > --
>>>>>> > Ryan Blue
>>>>>> > Databricks
>>>>>> >
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Databricks
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to