Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Thu, 25 Jul 2024 18:45:55 -0700

Hi community,

Thanks for joining the meeting to discuss variant shredding. For those who
were unable to attend the meeting, please check out the recorded meeting
<https://drive.google.com/file/d/1kiwv29nxxOqMCbxXn-NRoz-x2E9yIMlJ/view?usp=drive_link>
if
you are interested. Also to follow up on the meeting to converge on
lossiness discussion from shredding offline,  I have converted the spark
shredding proposal by David into google doc
<https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit>
and
please comment.


Thanks,
Aihua


On Thu, Jul 25, 2024 at 10:14 AM Aihua Xu <[email protected]> wrote:

> Yes. This time I was able to record it and I will share it when it’s
> processed.
>
>
> On Jul 25, 2024, at 10:01 AM, Amogh Jahagirdar <[email protected]> wrote:
>
> 
> Any chance this meeting was recorded? I couldn't make it but would be
> interested in catching up on the discussion.
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <[email protected]> wrote:
>
>> Thanks folks for additional discussion.
>>
>> There are some questions related to subcolumniziation (spark shredding -
>> see the discussion <https://github.com/apache/spark/pull/46831>) and we
>> would like to host another meeting to mainly discuss that since we plan to
>> adopt it. We can also follow up the Spark variant topics (I can see that
>> mostly we are aligned with the exception to find a place for the spec and
>> implementation). Look forward to meeting with you. BTW: should I include
>> [email protected] in the email invite?
>>
>> Sync up on Variant subcolumnization (shredding)
>> Thursday, July 25 · 8:00 – 9:00am
>> Time zone: America/Los_Angeles
>> Google Meet joining info
>> Video call link: https://meet.google.com/mug-dvnv-hnq
>> Or dial: ‪(US) +1 904-900-0730‬ PIN: ‪671 997 419‬#
>> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422
>>
>> Thanks,
>> Aihua
>>
>> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <[email protected]>
>> wrote:
>>
>>> I'm late replying to this but I'm also in agreement with 1 (adopting the
>>> spark variant encoding), 3 (specifically only having a variant type), and 4
>>> (ensuring we are thinking through subcolumnarization upfront since without
>>> it the variant type may not be that useful).
>>>
>>> I'd also support having the spec, and reference implementation in
>>> Iceberg; as others have said, it centralizes improvements in a single,
>>> agnostic dependency for engines, rather than engines having to take
>>> dependencies on other engine modules.
>>>
>>> Thanks,
>>>
>>> Amogh Jahagirdar
>>>
>>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> I have been looking around, how can we map Variant type in Flink. I
>>>> have not found any existing type which we could use, but Flink already have
>>>> some JSON parsing capabilities [1] for string fields.
>>>>
>>>> So until we have native support in Flink for something similar to
>>>> Vartiant type, I expect that we need to map it to JSON strings in RowData.
>>>>
>>>> Based on that, here are my preferences:
>>>> 1. I'm ok with adapting Spark Variant type, if we build our own Iceberg
>>>> serializer/deserializer module for it
>>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend it,
>>>> if needed. This could be important in the first phase. Later when it is
>>>> more stable we might donate it to some other project, like Parquet
>>>> 3. I would prefer to support only a single type, and Variant is more
>>>> expressive, but having a standard way to convert between JSON and Variant
>>>> would be useful for Flink users.
>>>> 4. On subcolumnarization: I think Flink will only use this feature as
>>>> much as the Iceberg readers implement this, so I would like to see as much
>>>> as possible of it in the common Iceberg code
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> [1] -
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions
>>>>
>>>>
>>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <[email protected]>
>>>> wrote:
>>>>
>>>>> Sorry for the late reply.  I agree with the sentiments on 1 and 3 that
>>>>> have already been posted (adopt the Spark encoding, and only have the
>>>>> Variant type).  As mentioned on the doc for 3, I think it would be good to
>>>>> specify how to map scalar types to a JSON representation so there can be
>>>>> consistency between engines that don't support variant.
>>>>>
>>>>>
>>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a
>>>>>> subproject for variant spec and implementation. But let me reach out to 
>>>>>> the
>>>>>> Spark community to discuss.
>>>>>
>>>>>
>>>>> The only  other place I can think of that might be a good home for
>>>>> Variant spec could be in Apache Arrow as a canonical extension type. There
>>>>> is an issue for this [1].  I think the main thing on where this is housed
>>>>> is which types are intended to be supported.  I believe Arrow is currently
>>>>> a superset of the Iceberg type system (UUID is supported as a canonical
>>>>> extension type [2]).
>>>>>
>>>>> For point 4 subcolumnarization, I think ideally this belongs in
>>>>> Iceberg (and if Iceberg and Delta Lake can agree on how to do it that 
>>>>> would
>>>>> be great) with potential consultation with Parquet/ORC communities to
>>>>> potentially add better native support.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>>
>>>>>
>>>>> [1] https://github.com/apache/arrow/issues/42069
>>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>>>>
>>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <[email protected]> wrote:
>>>>>
>>>>>> Thanks for the discussion and feedback.
>>>>>>
>>>>>> Do we have the consensus on point 1 and point 3 to move forward with
>>>>>> Spark variant encoding and support Variant type only? Or let me know how 
>>>>>> to
>>>>>> proceed from here.
>>>>>>
>>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a
>>>>>> subproject for variant spec and implementation. But let me reach out to 
>>>>>> the
>>>>>> Spark community to discuss.
>>>>>>
>>>>>> Thanks,
>>>>>> Aihua
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Agreed with point 1.
>>>>>>>
>>>>>>> For point 2, I also prefer to hold the spec and reference
>>>>>>> implementation under Iceberg. Here are the reasons:
>>>>>>> 1. It is unconventional and impractical for one engine to depend on
>>>>>>> another for data types. For instance, it is not ideal for Trino to rely 
>>>>>>> on
>>>>>>> data types defined by the Spark engine.
>>>>>>> 2. Iceberg serves as a bridge between engines and file formats. By
>>>>>>> centralizing the specification in Iceberg, any future optimizations or
>>>>>>> updates to file formats can be referred to within Iceberg, ensuring
>>>>>>> consistency and reducing dependencies.
>>>>>>>
>>>>>>> For point 3, I'd prefer to support the variant type only at this
>>>>>>> moment.
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue
>>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support only
>>>>>>>> variant for point 3.
>>>>>>>>
>>>>>>>> We'll need to work with the Spark community to find a good place
>>>>>>>> for the library and spec, since it touches many different projects. I'd
>>>>>>>> also prefer Iceberg as the home.
>>>>>>>>
>>>>>>>> I also think it's a good idea to get subcolumnarization into our
>>>>>>>> spec when we update. Without that I think the feature will be fairly
>>>>>>>> limited.
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I'm aligned with point 1.
>>>>>>>>>
>>>>>>>>> For point 2 I think we should choose quickly, I honestly do think
>>>>>>>>> this would be fine as part of the Iceberg Spec directly but 
>>>>>>>>> understand it
>>>>>>>>> may be better for the more broad community if it was a sub project. 
>>>>>>>>> As a
>>>>>>>>> sub-project I would still prefer it being an Iceberg Subproject since 
>>>>>>>>> we
>>>>>>>>> are engine/file-format agnostic.
>>>>>>>>>
>>>>>>>>> 3. I support adding just Variant.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello community,
>>>>>>>>>>
>>>>>>>>>> It’s great to sync up with some of you on Variant and
>>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I didn’t 
>>>>>>>>>> record
>>>>>>>>>> the meeting but here are some key items that we want to follow up 
>>>>>>>>>> with the
>>>>>>>>>> community.
>>>>>>>>>>
>>>>>>>>>> 1. Adopt Spark Variant encoding
>>>>>>>>>> Those present were in favor of  adopting the Spark variant
>>>>>>>>>> encoding for Iceberg Variant with extensions to support other Iceberg
>>>>>>>>>> types. We would like to know if anyone has an objection to this to 
>>>>>>>>>> reuse an
>>>>>>>>>> open source encoding.
>>>>>>>>>>
>>>>>>>>>> 2. Movement of the Spark Variant Spec to another project
>>>>>>>>>> To avoid introducing Apache Spark as a dependency for the engines
>>>>>>>>>> and file formats, we discussed separating Spark Variant encoding 
>>>>>>>>>> spec and
>>>>>>>>>> implementation from the Spark Project to a neutral location. We 
>>>>>>>>>> thought up
>>>>>>>>>> several solutions but didn’t have consensus on any of them. We are 
>>>>>>>>>> looking
>>>>>>>>>> for more feedback on this topic from the community either in terms of
>>>>>>>>>> support for one of these options or another idea on how to support 
>>>>>>>>>> the spec.
>>>>>>>>>>
>>>>>>>>>> Options Proposed:
>>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other
>>>>>>>>>> engines)
>>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult for
>>>>>>>>>> other Table Formats)
>>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec
>>>>>>>>>> and reference implementation there (Logistically complicated)
>>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec and
>>>>>>>>>> reference implementation there (Logistically complicated)
>>>>>>>>>>
>>>>>>>>>> 3. Add Variant type vs. Variant and JSON types
>>>>>>>>>> Those who were present were in favor of adding only the Variant
>>>>>>>>>> type to Iceberg. We are looking for anyone who has an objection to 
>>>>>>>>>> going
>>>>>>>>>> forward with just the Variant Type and no Iceberg JSON Type. We were
>>>>>>>>>> favoring adding Variant type only because:
>>>>>>>>>> * Introducing a JSON type would require engines that only support
>>>>>>>>>> VARIANT to do write time validation of their input to a JSON column. 
>>>>>>>>>> If
>>>>>>>>>> they don’t have a JSON type an engine wouldn’t support this.
>>>>>>>>>> * Engines which don’t support Variant will work most of the time
>>>>>>>>>> but can have fallback strings defined in the spec for reading 
>>>>>>>>>> unsupported
>>>>>>>>>> types. Writing a JSON into a Variant will always work.
>>>>>>>>>>
>>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>>>>>>>> We have no action items on this but would like to follow up on
>>>>>>>>>> discussions on Subcolumnization in the future.
>>>>>>>>>> * We had general agreement that this should be included in
>>>>>>>>>> Iceberg V3 or else adding variant may not be useful.
>>>>>>>>>> * We are interested in also adopting the shredding spec from
>>>>>>>>>> Spark and would like to move it to whatever place we decided the 
>>>>>>>>>> Variant
>>>>>>>>>> spec is going to live.
>>>>>>>>>>
>>>>>>>>>> Let us know if missed anything and if you have any additional
>>>>>>>>>> thoughts or suggestions.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Aihua
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>>>>>>>> > Thanks for the discussion.
>>>>>>>>>> >
>>>>>>>>>> > I will move forward to work on spec PR.
>>>>>>>>>> >
>>>>>>>>>> > Regarding the implementation, we will have module for Variant
>>>>>>>>>> support in Iceberg so we will not have to bring in Spark libraries.
>>>>>>>>>> >
>>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in my
>>>>>>>>>> original email since I included in the end. Looks like we don't have 
>>>>>>>>>> major
>>>>>>>>>> objections/diverges but let's sync up and have consensus.
>>>>>>>>>> >
>>>>>>>>>> > Meeting invite:
>>>>>>>>>> >
>>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>> > Time zone: America/Los_Angeles
>>>>>>>>>> > Google Meet joining info
>>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>>>>>>> > More phone numbers:
>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Aihua
>>>>>>>>>> >
>>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>>>>>>>> > > I don't think this needs to hold up the PR but I think coming
>>>>>>>>>> to a
>>>>>>>>>> > > consensus on the exact set of types supported is worthwhile
>>>>>>>>>> (and if the
>>>>>>>>>> > > goal is to maintain the same set as specified by the Spark
>>>>>>>>>> Variant type or
>>>>>>>>>> > > if divergence is expected/allowed).  From a fragmentation
>>>>>>>>>> perspective it
>>>>>>>>>> > > would be a shame if they diverge, so maybe a next step is
>>>>>>>>>> also suggesting
>>>>>>>>>> > > support to the Spark community on the missing existing
>>>>>>>>>> Iceberg types?
>>>>>>>>>> > >
>>>>>>>>>> > > Thanks,
>>>>>>>>>> > > Micah
>>>>>>>>>> > >
>>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>>>>>>>> [email protected]>
>>>>>>>>>> > > wrote:
>>>>>>>>>> > >
>>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR now.
>>>>>>>>>> We can get
>>>>>>>>>> > > > feedback there from everyone.
>>>>>>>>>> > > >
>>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>>>>>>>> <[email protected]>
>>>>>>>>>> > > > wrote:
>>>>>>>>>> > > >
>>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get
>>>>>>>>>> their feedback in
>>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr didn't
>>>>>>>>>> seem to object
>>>>>>>>>> > > >> to the encoding from what I read of his comments.
>>>>>>>>>> Hopefully he (and others)
>>>>>>>>>> > > >> chime in here.
>>>>>>>>>> > > >>
>>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>>>>>>>> > > >> [email protected]> wrote:
>>>>>>>>>> > > >>
>>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on board
>>>>>>>>>> as
>>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make
>>>>>>>>>> sure we have anyone
>>>>>>>>>> > > >>> else chime in who has experience with Ray if possible.
>>>>>>>>>> > > >>>
>>>>>>>>>> > > >>> Spec changes feel like the right next step.
>>>>>>>>>> > > >>>
>>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>>>>>>>> <[email protected]>
>>>>>>>>>> > > >>> wrote:
>>>>>>>>>> > > >>>
>>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has
>>>>>>>>>> been out for
>>>>>>>>>> > > >>>> quite a while and I don't see any major objections to
>>>>>>>>>> using the Spark
>>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need
>>>>>>>>>> well. It can also be
>>>>>>>>>> > > >>>> extended to support additional types that are missing if
>>>>>>>>>> that's a priority.
>>>>>>>>>> > > >>>>
>>>>>>>>>> > > >>>> Should we move forward by starting a draft of the
>>>>>>>>>> changes to the table
>>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes and
>>>>>>>>>> get moving on an
>>>>>>>>>> > > >>>> implementation (or possibly do the implementation in
>>>>>>>>>> parallel).
>>>>>>>>>> > > >>>>
>>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>>>>>>>> > > >>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>
>>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>>>>>>>> > > >>>>>
>>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>>>>>>>> <[email protected]>
>>>>>>>>>> > > >>>>> wrote:
>>>>>>>>>> > > >>>>>
>>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in
>>>>>>>>>> parquet proper
>>>>>>>>>> > > >>>>>> right?
>>>>>>>>>> > > >>>>>>
>>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it
>>>>>>>>>> should end up.
>>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it?
>>>>>>>>>> > > >>>>>>
>>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>>>>>>> > > >>>>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>>>
>>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in
>>>>>>>>>> parquet proper
>>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg
>>>>>>>>>> though for the time
>>>>>>>>>> > > >>>>>>> being.
>>>>>>>>>> > > >>>>>>>
>>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>>>>>>>> > > >>>>>>> <[email protected]> wrote:
>>>>>>>>>> > > >>>>>>>
>>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought
>>>>>>>>>> this up in his
>>>>>>>>>> > > >>>>>>>> last email:
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark
>>>>>>>>>> implementation in
>>>>>>>>>> > > >>>>>>>> Iceberg?
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the
>>>>>>>>>> Spark library. What
>>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg?
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>> Ryan
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>>>>>>>> [email protected]>
>>>>>>>>>> > > >>>>>>>> wrote:
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a
>>>>>>>>>> comment on the doc
>>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact
>>>>>>>>>> that would be a much
>>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of
>>>>>>>>>> Spark, but I even then I
>>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend
>>>>>>>>>> on that because it is a
>>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton
>>>>>>>>>> of Scala libs. I think
>>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an independent
>>>>>>>>>> implementation of the
>>>>>>>>>> > > >>>>>>>>> spec in Iceberg.
>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>>>>>> > > >>>>>>>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>> Hi Aihua,
>>>>>>>>>> > > >>>>>>>>>> Long time no see :)
>>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to
>>>>>>>>>> support Variant
>>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like
>>>>>>>>>> Flink/Trino/Hive etc?
>>>>>>>>>> > > >>>>>>>>>> Thanks, Peter
>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue
>>>>>>>>>> Spark encoding to
>>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>>>>>>>> implementation: do we
>>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark
>>>>>>>>>> implementation in Iceberg? Russell
>>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark
>>>>>>>>>> dependency and that could be a
>>>>>>>>>> > > >>>>>>>>>>> problem?
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> Thanks,
>>>>>>>>>> > > >>>>>>>>>>> Aihua
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current
>>>>>>>>>> doc is a good
>>>>>>>>>> > > >>>>>>>>>>> one. I went
>>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it
>>>>>>>>>> looks like a
>>>>>>>>>> > > >>>>>>>>>>> better choice than
>>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly
>>>>>>>>>> accessing nested
>>>>>>>>>> > > >>>>>>>>>>> fields.
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that
>>>>>>>>>> this is what
>>>>>>>>>> > > >>>>>>>>>>> Delta's variant
>>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables
>>>>>>>>>> written by Delta
>>>>>>>>>> > > >>>>>>>>>>> could be
>>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without
>>>>>>>>>> needing to rewrite
>>>>>>>>>> > > >>>>>>>>>>> variant
>>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and
>>>>>>>>>> have an
>>>>>>>>>> > > >>>>>>>>>>> interest in
>>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > Ryan
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>>>>>>>> > > >>>>>>>>>>> [email protected]>
>>>>>>>>>> > > >>>>>>>>>>> > wrote:
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant
>>>>>>>>>> type proposal
>>>>>>>>>> > > >>>>>>>>>>> in the
>>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to
>>>>>>>>>> host a meeting
>>>>>>>>>> > > >>>>>>>>>>> next week
>>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any
>>>>>>>>>> further
>>>>>>>>>> > > >>>>>>>>>>> concerns about the
>>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other
>>>>>>>>>> questions on the
>>>>>>>>>> > > >>>>>>>>>>> first phase of
>>>>>>>>>> > > >>>>>>>>>>> > > the proposal
>>>>>>>>>> > > >>>>>>>>>>> > > <
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>> > > >>>>>>>>>>> >.
>>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested
>>>>>>>>>> in the proposal
>>>>>>>>>> > > >>>>>>>>>>> can either join
>>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can
>>>>>>>>>> discuss them. Summary
>>>>>>>>>> > > >>>>>>>>>>> of the
>>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the
>>>>>>>>>> mailing list for
>>>>>>>>>> > > >>>>>>>>>>> further comment
>>>>>>>>>> > > >>>>>>>>>>> > > there.
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>>>>>>>> representation
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc
>>>>>>>>>> including ION,
>>>>>>>>>> > > >>>>>>>>>>> JSONB, and
>>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying
>>>>>>>>>> encoding is an
>>>>>>>>>> > > >>>>>>>>>>> important first step
>>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general support
>>>>>>>>>> for Spark’s
>>>>>>>>>> > > >>>>>>>>>>> Variant encoding.
>>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has
>>>>>>>>>> strong opinions in
>>>>>>>>>> > > >>>>>>>>>>> this space.
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    Should we support multiple logical types
>>>>>>>>>> or just Variant?
>>>>>>>>>> > > >>>>>>>>>>> Variant vs.
>>>>>>>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s)
>>>>>>>>>> to be supported
>>>>>>>>>> > > >>>>>>>>>>> in Iceberg -
>>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types
>>>>>>>>>> would share the
>>>>>>>>>> > > >>>>>>>>>>> same underlying
>>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different
>>>>>>>>>> limitations on engines
>>>>>>>>>> > > >>>>>>>>>>> working with
>>>>>>>>>> > > >>>>>>>>>>> > > those types.
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more
>>>>>>>>>> favoring toward
>>>>>>>>>> > > >>>>>>>>>>> supporting Variant
>>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the
>>>>>>>>>> supported
>>>>>>>>>> > > >>>>>>>>>>> type(s).
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >    How should we move forward with
>>>>>>>>>> Subcolumnization?
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for
>>>>>>>>>> Variant type by
>>>>>>>>>> > > >>>>>>>>>>> separating out
>>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is
>>>>>>>>>> not critical for
>>>>>>>>>> > > >>>>>>>>>>> choosing the
>>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we
>>>>>>>>>> were hoping to
>>>>>>>>>> > > >>>>>>>>>>> gain consensus on
>>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > Thanks
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > Aihua
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>>>>>>>> > > >>>>>>>>>>> > > Video call link:
>>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576
>>>>>>>>>> 525‬#
>>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>>>>>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>>>>>>>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>>> > > >>>>>>>>>>> > >> Hello,
>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>>>>>>>> > > >>>>>>>>>>> > >> <
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review
>>>>>>>>>> and comment.
>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>> > > >>>>>>>>>>> > >> Thanks,
>>>>>>>>>> > > >>>>>>>>>>> > >> Aihua
>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>>>>>>>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had
>>>>>>>>>> the same
>>>>>>>>>> > > >>>>>>>>>>> discussion internally
>>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with
>>>>>>>>>> for example
>>>>>>>>>> > > >>>>>>>>>>> the SUPER type in
>>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>>>>>>>> > > >>>>>>>>>>> and
>>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with
>>>>>>>>>> the Trino JSON
>>>>>>>>>> > > >>>>>>>>>>> type.
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>> > >>> Best,
>>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>>>>>>>> > > >>>>>>>>>>> > >>> <[email protected]> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <
>>>>>>>>>> [email protected]>
>>>>>>>>>> > > >>>>>>>>>>> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how
>>>>>>>>>> many we need to
>>>>>>>>>> > > >>>>>>>>>>> look at;
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino,
>>>>>>>>>> but weren't sure
>>>>>>>>>> > > >>>>>>>>>>> how much
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed
>>>>>>>>>> to go。
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the
>>>>>>>>>> Java world. It
>>>>>>>>>> > > >>>>>>>>>>> would be
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the
>>>>>>>>>> effort it takes to
>>>>>>>>>> > > >>>>>>>>>>> integrate
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox,
>>>>>>>>>> datafusion, etc.).
>>>>>>>>>> > > >>>>>>>>>>> This is something
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> that
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also
>>>>>>>>>> care about.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to
>>>>>>>>>> share some
>>>>>>>>>> > > >>>>>>>>>>> perspective on this.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's
>>>>>>>>>> a binary type
>>>>>>>>>> > > >>>>>>>>>>> and Iceberg and
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the
>>>>>>>>>> binary column
>>>>>>>>>> > > >>>>>>>>>>> needs to be
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should
>>>>>>>>>> be sufficient.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability,
>>>>>>>>>> it would be
>>>>>>>>>> > > >>>>>>>>>>> good to support
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> native
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier
>>>>>>>>>> for projects
>>>>>>>>>> > > >>>>>>>>>>> like Apache
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide
>>>>>>>>>> finer-grained statistics
>>>>>>>>>> > > >>>>>>>>>>> for variant
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional
>>>>>>>>>> value in
>>>>>>>>>> > > >>>>>>>>>>> native file format
>>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that
>>>>>>>>>> it's not a
>>>>>>>>>> > > >>>>>>>>>>> strict requirement.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler
>>>>>>>>>> Akidau
>>>>>>>>>> > > >>>>>>>>>>> > >>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB!
>>>>>>>>>> Thanks!
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM
>>>>>>>>>> Jean-Baptiste Onofré <
>>>>>>>>>> > > >>>>>>>>>>> [email protected]>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It
>>>>>>>>>> reminds me our
>>>>>>>>>> > > >>>>>>>>>>> discussions back in
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty
>>>>>>>>>> interesting. I remember
>>>>>>>>>> > > >>>>>>>>>>> some discussions
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The
>>>>>>>>>> binary data type
>>>>>>>>>> > > >>>>>>>>>>> is already
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and
>>>>>>>>>> happy to help
>>>>>>>>>> > > >>>>>>>>>>> on this !
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler
>>>>>>>>>> Akidau
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua)
>>>>>>>>>> are working on a
>>>>>>>>>> > > >>>>>>>>>>> proposal for
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback
>>>>>>>>>> from the
>>>>>>>>>> > > >>>>>>>>>>> community. As you may know,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its
>>>>>>>>>> open Data Lake
>>>>>>>>>> > > >>>>>>>>>>> format. Having made
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of
>>>>>>>>>> the Iceberg
>>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not
>>>>>>>>>> yet supported in
>>>>>>>>>> > > >>>>>>>>>>> Iceberg which we
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users,
>>>>>>>>>> and that we
>>>>>>>>>> > > >>>>>>>>>>> would like to discuss
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg
>>>>>>>>>> community.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like
>>>>>>>>>> to discuss are
>>>>>>>>>> > > >>>>>>>>>>> in support of
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
>>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of
>>>>>>>>>> variant columns. In
>>>>>>>>>> > > >>>>>>>>>>> more detail, for
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient
>>>>>>>>>> binary
>>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON,
>>>>>>>>>> Avro, etc. By
>>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the
>>>>>>>>>> flexibility of
>>>>>>>>>> > > >>>>>>>>>>> the source data,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more
>>>>>>>>>> efficiently
>>>>>>>>>> > > >>>>>>>>>>> operate on the data.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant
>>>>>>>>>> data type on
>>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users
>>>>>>>>>> utilize Iceberg
>>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of
>>>>>>>>>> requests for
>>>>>>>>>> > > >>>>>>>>>>> variant support.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such
>>>>>>>>>> as Apache Spark
>>>>>>>>>> > > >>>>>>>>>>> have begun adding
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we
>>>>>>>>>> believe it would be
>>>>>>>>>> > > >>>>>>>>>>> beneficial to the
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to
>>>>>>>>>> standardize on the
>>>>>>>>>> > > >>>>>>>>>>> variant data type
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is
>>>>>>>>>> that, since an
>>>>>>>>>> > > >>>>>>>>>>> Apache OSS
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already
>>>>>>>>>> exists in Spark,
>>>>>>>>>> > > >>>>>>>>>>> it likely makes sense
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as
>>>>>>>>>> the Iceberg
>>>>>>>>>> > > >>>>>>>>>>> standard as well. The
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in
>>>>>>>>>> Snowflake is
>>>>>>>>>> > > >>>>>>>>>>> slightly different, but
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no
>>>>>>>>>> particular value
>>>>>>>>>> > > >>>>>>>>>>> in trying to clutter
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>>>>>>>> equivalent-but-incompatible
>>>>>>>>>> > > >>>>>>>>>>> encoding.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns
>>>>>>>>>> allows query
>>>>>>>>>> > > >>>>>>>>>>> engines to
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when
>>>>>>>>>> subcolumns (i.e.,
>>>>>>>>>> > > >>>>>>>>>>> nested fields) within a
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also
>>>>>>>>>> allows optionally
>>>>>>>>>> > > >>>>>>>>>>> materializing some
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on
>>>>>>>>>> their own,
>>>>>>>>>> > > >>>>>>>>>>> affording queries on these
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less
>>>>>>>>>> data and spend
>>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system
>>>>>>>>>> managing table
>>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min,
>>>>>>>>>> max, null, etc.)
>>>>>>>>>> > > >>>>>>>>>>> for some subset of the
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and
>>>>>>>>>> also manages any
>>>>>>>>>> > > >>>>>>>>>>> optional
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without
>>>>>>>>>> subcolumnarization, any query
>>>>>>>>>> > > >>>>>>>>>>> which touches a
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse,
>>>>>>>>>> extract, and filter
>>>>>>>>>> > > >>>>>>>>>>> every row for which
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by
>>>>>>>>>> providing a
>>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant
>>>>>>>>>> columns,
>>>>>>>>>> > > >>>>>>>>>>> Iceberg can make
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible
>>>>>>>>>> across various
>>>>>>>>>> > > >>>>>>>>>>> catalogs and query
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial
>>>>>>>>>> topic, so we
>>>>>>>>>> > > >>>>>>>>>>> expect any
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only
>>>>>>>>>> the set of
>>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query
>>>>>>>>>> engines to
>>>>>>>>>> > > >>>>>>>>>>> interopate on
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant
>>>>>>>>>> columns, but also
>>>>>>>>>> > > >>>>>>>>>>> reference
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining
>>>>>>>>>> subcolumnarization principles
>>>>>>>>>> > > >>>>>>>>>>> and recommended best
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo
>>>>>>>>>> proposal [3] may be a
>>>>>>>>>> > > >>>>>>>>>>> good starting
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our
>>>>>>>>>> plan is to
>>>>>>>>>> > > >>>>>>>>>>> write something up in
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec
>>>>>>>>>> changes,
>>>>>>>>>> > > >>>>>>>>>>> backwards compatibility,
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted
>>>>>>>>>> to first reach
>>>>>>>>>> > > >>>>>>>>>>> out to the community
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea,
>>>>>>>>>> and see if
>>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend
>>>>>>>>>> too much time on
>>>>>>>>>> > > >>>>>>>>>>> a concrete proposal.
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>> > --
>>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue
>>>>>>>>>> > > >>>>>>>>>>> > Databricks
>>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>> > > >>>>>>>>> --
>>>>>>>>>> > > >>>>>>>>> Ryan Blue
>>>>>>>>>> > > >>>>>>>>> Databricks
>>>>>>>>>> > > >>>>>>>>>
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>> --
>>>>>>>>>> > > >>>>>>>> Ryan Blue
>>>>>>>>>> > > >>>>>>>> Databricks
>>>>>>>>>> > > >>>>>>>>
>>>>>>>>>> > > >>>>>>>
>>>>>>>>>> > > >>>>>>
>>>>>>>>>> > > >>>>>> --
>>>>>>>>>> > > >>>>>> Ryan Blue
>>>>>>>>>> > > >>>>>> Databricks
>>>>>>>>>> > > >>>>>>
>>>>>>>>>> > > >>>>>
>>>>>>>>>> > > >>>>
>>>>>>>>>> > > >>>> --
>>>>>>>>>> > > >>>> Ryan Blue
>>>>>>>>>> > > >>>> Databricks
>>>>>>>>>> > > >>>>
>>>>>>>>>> > > >>>
>>>>>>>>>> > > >>
>>>>>>>>>> > > >> --
>>>>>>>>>> > > >> Ryan Blue
>>>>>>>>>> > > >> Databricks
>>>>>>>>>> > > >>
>>>>>>>>>> > > >
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Databricks
>>>>>>>>
>>>>>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to