Re: [Early Feedback] Variant and Subcolumnarization Support

Amogh Jahagirdar Thu, 25 Jul 2024 10:01:06 -0700

Any chance this meeting was recorded? I couldn't make it but would be
interested in catching up on the discussion.


Thanks,

Amogh Jahagirdar

On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <[email protected]> wrote:

> Thanks folks for additional discussion.
>
> There are some questions related to subcolumniziation (spark shredding -
> see the discussion <https://github.com/apache/spark/pull/46831>) and we
> would like to host another meeting to mainly discuss that since we plan to
> adopt it. We can also follow up the Spark variant topics (I can see that
> mostly we are aligned with the exception to find a place for the spec and
> implementation). Look forward to meeting with you. BTW: should I include
> [email protected] in the email invite?
>
> Sync up on Variant subcolumnization (shredding)
> Thursday, July 25 · 8:00 – 9:00am
> Time zone: America/Los_Angeles
> Google Meet joining info
> Video call link: https://meet.google.com/mug-dvnv-hnq
> Or dial: ‪(US) +1 904-900-0730‬ PIN: ‪671 997 419‬#
> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422
>
> Thanks,
> Aihua
>
> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <[email protected]> wrote:
>
>> I'm late replying to this but I'm also in agreement with 1 (adopting the
>> spark variant encoding), 3 (specifically only having a variant type), and 4
>> (ensuring we are thinking through subcolumnarization upfront since without
>> it the variant type may not be that useful).
>>
>> I'd also support having the spec, and reference implementation in
>> Iceberg; as others have said, it centralizes improvements in a single,
>> agnostic dependency for engines, rather than engines having to take
>> dependencies on other engine modules.
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> I have been looking around, how can we map Variant type in Flink. I have
>>> not found any existing type which we could use, but Flink already have some
>>> JSON parsing capabilities [1] for string fields.
>>>
>>> So until we have native support in Flink for something similar to
>>> Vartiant type, I expect that we need to map it to JSON strings in RowData.
>>>
>>> Based on that, here are my preferences:
>>> 1. I'm ok with adapting Spark Variant type, if we build our own Iceberg
>>> serializer/deserializer module for it
>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend it, if
>>> needed. This could be important in the first phase. Later when it is more
>>> stable we might donate it to some other project, like Parquet
>>> 3. I would prefer to support only a single type, and Variant is more
>>> expressive, but having a standard way to convert between JSON and Variant
>>> would be useful for Flink users.
>>> 4. On subcolumnarization: I think Flink will only use this feature as
>>> much as the Iceberg readers implement this, so I would like to see as much
>>> as possible of it in the common Iceberg code
>>>
>>> Thanks,
>>> Peter
>>>
>>> [1] -
>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions
>>>
>>>
>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Sorry for the late reply.  I agree with the sentiments on 1 and 3 that
>>>> have already been posted (adopt the Spark encoding, and only have the
>>>> Variant type).  As mentioned on the doc for 3, I think it would be good to
>>>> specify how to map scalar types to a JSON representation so there can be
>>>> consistency between engines that don't support variant.
>>>>
>>>>
>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a
>>>>> subproject for variant spec and implementation. But let me reach out to 
>>>>> the
>>>>> Spark community to discuss.
>>>>
>>>>
>>>> The only  other place I can think of that might be a good home for
>>>> Variant spec could be in Apache Arrow as a canonical extension type. There
>>>> is an issue for this [1].  I think the main thing on where this is housed
>>>> is which types are intended to be supported.  I believe Arrow is currently
>>>> a superset of the Iceberg type system (UUID is supported as a canonical
>>>> extension type [2]).
>>>>
>>>> For point 4 subcolumnarization, I think ideally this belongs in Iceberg
>>>> (and if Iceberg and Delta Lake can agree on how to do it that would be
>>>> great) with potential consultation with Parquet/ORC communities to
>>>> potentially add better native support.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>>
>>>> [1] https://github.com/apache/arrow/issues/42069
>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>>>
>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <[email protected]> wrote:
>>>>
>>>>> Thanks for the discussion and feedback.
>>>>>
>>>>> Do we have the consensus on point 1 and point 3 to move forward with
>>>>> Spark variant encoding and support Variant type only? Or let me know how 
>>>>> to
>>>>> proceed from here.
>>>>>
>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a
>>>>> subproject for variant spec and implementation. But let me reach out to 
>>>>> the
>>>>> Spark community to discuss.
>>>>>
>>>>> Thanks,
>>>>> Aihua
>>>>>
>>>>>
>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <[email protected]> wrote:
>>>>>
>>>>>> Agreed with point 1.
>>>>>>
>>>>>> For point 2, I also prefer to hold the spec and reference
>>>>>> implementation under Iceberg. Here are the reasons:
>>>>>> 1. It is unconventional and impractical for one engine to depend on
>>>>>> another for data types. For instance, it is not ideal for Trino to rely 
>>>>>> on
>>>>>> data types defined by the Spark engine.
>>>>>> 2. Iceberg serves as a bridge between engines and file formats. By
>>>>>> centralizing the specification in Iceberg, any future optimizations or
>>>>>> updates to file formats can be referred to within Iceberg, ensuring
>>>>>> consistency and reducing dependencies.
>>>>>>
>>>>>> For point 3, I'd prefer to support the variant type only at this
>>>>>> moment.
>>>>>>
>>>>>> Yufei
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support only
>>>>>>> variant for point 3.
>>>>>>>
>>>>>>> We'll need to work with the Spark community to find a good place for
>>>>>>> the library and spec, since it touches many different projects. I'd also
>>>>>>> prefer Iceberg as the home.
>>>>>>>
>>>>>>> I also think it's a good idea to get subcolumnarization into our
>>>>>>> spec when we update. Without that I think the feature will be fairly
>>>>>>> limited.
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I'm aligned with point 1.
>>>>>>>>
>>>>>>>> For point 2 I think we should choose quickly, I honestly do think
>>>>>>>> this would be fine as part of the Iceberg Spec directly but understand 
>>>>>>>> it
>>>>>>>> may be better for the more broad community if it was a sub project. As 
>>>>>>>> a
>>>>>>>> sub-project I would still prefer it being an Iceberg Subproject since 
>>>>>>>> we
>>>>>>>> are engine/file-format agnostic.
>>>>>>>>
>>>>>>>> 3. I support adding just Variant.
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello community,
>>>>>>>>>
>>>>>>>>> It’s great to sync up with some of you on Variant and
>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I didn’t 
>>>>>>>>> record
>>>>>>>>> the meeting but here are some key items that we want to follow up 
>>>>>>>>> with the
>>>>>>>>> community.
>>>>>>>>>
>>>>>>>>> 1. Adopt Spark Variant encoding
>>>>>>>>> Those present were in favor of  adopting the Spark variant
>>>>>>>>> encoding for Iceberg Variant with extensions to support other Iceberg
>>>>>>>>> types. We would like to know if anyone has an objection to this to 
>>>>>>>>> reuse an
>>>>>>>>> open source encoding.
>>>>>>>>>
>>>>>>>>> 2. Movement of the Spark Variant Spec to another project
>>>>>>>>> To avoid introducing Apache Spark as a dependency for the engines
>>>>>>>>> and file formats, we discussed separating Spark Variant encoding spec 
>>>>>>>>> and
>>>>>>>>> implementation from the Spark Project to a neutral location. We 
>>>>>>>>> thought up
>>>>>>>>> several solutions but didn’t have consensus on any of them. We are 
>>>>>>>>> looking
>>>>>>>>> for more feedback on this topic from the community either in terms of
>>>>>>>>> support for one of these options or another idea on how to support 
>>>>>>>>> the spec.
>>>>>>>>>
>>>>>>>>> Options Proposed:
>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other
>>>>>>>>> engines)
>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult for
>>>>>>>>> other Table Formats)
>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and
>>>>>>>>> reference implementation there (Logistically complicated)
>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec and
>>>>>>>>> reference implementation there (Logistically complicated)
>>>>>>>>>
>>>>>>>>> 3. Add Variant type vs. Variant and JSON types
>>>>>>>>> Those who were present were in favor of adding only the Variant
>>>>>>>>> type to Iceberg. We are looking for anyone who has an objection to 
>>>>>>>>> going
>>>>>>>>> forward with just the Variant Type and no Iceberg JSON Type. We were
>>>>>>>>> favoring adding Variant type only because:
>>>>>>>>> * Introducing a JSON type would require engines that only support
>>>>>>>>> VARIANT to do write time validation of their input to a JSON column. 
>>>>>>>>> If
>>>>>>>>> they don’t have a JSON type an engine wouldn’t support this.
>>>>>>>>> * Engines which don’t support Variant will work most of the time
>>>>>>>>> but can have fallback strings defined in the spec for reading 
>>>>>>>>> unsupported
>>>>>>>>> types. Writing a JSON into a Variant will always work.
>>>>>>>>>
>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>>>>>>> We have no action items on this but would like to follow up on
>>>>>>>>> discussions on Subcolumnization in the future.
>>>>>>>>> * We had general agreement that this should be included in Iceberg
>>>>>>>>> V3 or else adding variant may not be useful.
>>>>>>>>> * We are interested in also adopting the shredding spec from Spark
>>>>>>>>> and would like to move it to whatever place we decided the Variant 
>>>>>>>>> spec is
>>>>>>>>> going to live.
>>>>>>>>>
>>>>>>>>> Let us know if missed anything and if you have any additional
>>>>>>>>> thoughts or suggestions.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Aihua
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>>>>>>> > Thanks for the discussion.
>>>>>>>>> >
>>>>>>>>> > I will move forward to work on spec PR.
>>>>>>>>> >
>>>>>>>>> > Regarding the implementation, we will have module for Variant
>>>>>>>>> support in Iceberg so we will not have to bring in Spark libraries.
>>>>>>>>> >
>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in my
>>>>>>>>> original email since I included in the end. Looks like we don't have 
>>>>>>>>> major
>>>>>>>>> objections/diverges but let's sync up and have consensus.
>>>>>>>>> >
>>>>>>>>> > Meeting invite:
>>>>>>>>> >
>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>> > Time zone: America/Los_Angeles
>>>>>>>>> > Google Meet joining info
>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>>>>>>> > More phone numbers:
>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Aihua
>>>>>>>>> >
>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>>>>>>> > > I don't think this needs to hold up the PR but I think coming
>>>>>>>>> to a
>>>>>>>>> > > consensus on the exact set of types supported is worthwhile
>>>>>>>>> (and if the
>>>>>>>>> > > goal is to maintain the same set as specified by the Spark
>>>>>>>>> Variant type or
>>>>>>>>> > > if divergence is expected/allowed).  From a fragmentation
>>>>>>>>> perspective it
>>>>>>>>> > > would be a shame if they diverge, so maybe a next step is also
>>>>>>>>> suggesting
>>>>>>>>> > > support to the Spark community on the missing existing Iceberg
>>>>>>>>> types?
>>>>>>>>> > >
>>>>>>>>> > > Thanks,
>>>>>>>>> > > Micah
>>>>>>>>> > >
>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>>>>>>> [email protected]>
>>>>>>>>> > > wrote:
>>>>>>>>> > >
>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR now.
>>>>>>>>> We can get
>>>>>>>>> > > > feedback there from everyone.
>>>>>>>>> > > >
>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>>>>>>> <[email protected]>
>>>>>>>>> > > > wrote:
>>>>>>>>> > > >
>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get their
>>>>>>>>> feedback in
>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr didn't
>>>>>>>>> seem to object
>>>>>>>>> > > >> to the encoding from what I read of his comments. Hopefully
>>>>>>>>> he (and others)
>>>>>>>>> > > >> chime in here.
>>>>>>>>> > > >>
>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>>>>>>> > > >> [email protected]> wrote:
>>>>>>>>> > > >>
>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on board as
>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make sure
>>>>>>>>> we have anyone
>>>>>>>>> > > >>> else chime in who has experience with Ray if possible.
>>>>>>>>> > > >>>
>>>>>>>>> > > >>> Spec changes feel like the right next step.
>>>>>>>>> > > >>>
>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>>>>>>> <[email protected]>
>>>>>>>>> > > >>> wrote:
>>>>>>>>> > > >>>
>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has
>>>>>>>>> been out for
>>>>>>>>> > > >>>> quite a while and I don't see any major objections to
>>>>>>>>> using the Spark
>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need
>>>>>>>>> well. It can also be
>>>>>>>>> > > >>>> extended to support additional types that are missing if
>>>>>>>>> that's a priority.
>>>>>>>>> > > >>>>
>>>>>>>>> > > >>>> Should we move forward by starting a draft of the changes
>>>>>>>>> to the table
>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes and
>>>>>>>>> get moving on an
>>>>>>>>> > > >>>> implementation (or possibly do the implementation in
>>>>>>>>> parallel).
>>>>>>>>> > > >>>>
>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>>>>>>> > > >>>> [email protected]> wrote:
>>>>>>>>> > > >>>>
>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>>>>>>> > > >>>>>
>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>>>>>>> <[email protected]>
>>>>>>>>> > > >>>>> wrote:
>>>>>>>>> > > >>>>>
>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in
>>>>>>>>> parquet proper
>>>>>>>>> > > >>>>>> right?
>>>>>>>>> > > >>>>>>
>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it
>>>>>>>>> should end up.
>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it?
>>>>>>>>> > > >>>>>>
>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>>>>>> > > >>>>>> [email protected]> wrote:
>>>>>>>>> > > >>>>>>
>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in
>>>>>>>>> parquet proper
>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg
>>>>>>>>> though for the time
>>>>>>>>> > > >>>>>>> being.
>>>>>>>>> > > >>>>>>>
>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>>>>>>> > > >>>>>>> <[email protected]> wrote:
>>>>>>>>> > > >>>>>>>
>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this
>>>>>>>>> up in his
>>>>>>>>> > > >>>>>>>> last email:
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark
>>>>>>>>> implementation in
>>>>>>>>> > > >>>>>>>> Iceberg?
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark
>>>>>>>>> library. What
>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg?
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>> Ryan
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>>>>>>> [email protected]>
>>>>>>>>> > > >>>>>>>> wrote:
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a
>>>>>>>>> comment on the doc
>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that
>>>>>>>>> would be a much
>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of
>>>>>>>>> Spark, but I even then I
>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend
>>>>>>>>> on that because it is a
>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton
>>>>>>>>> of Scala libs. I think
>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an independent
>>>>>>>>> implementation of the
>>>>>>>>> > > >>>>>>>>> spec in Iceberg.
>>>>>>>>> > > >>>>>>>>>
>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>>>>> > > >>>>>>>>> [email protected]> wrote:
>>>>>>>>> > > >>>>>>>>>
>>>>>>>>> > > >>>>>>>>>> Hi Aihua,
>>>>>>>>> > > >>>>>>>>>> Long time no see :)
>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to
>>>>>>>>> support Variant
>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like
>>>>>>>>> Flink/Trino/Hive etc?
>>>>>>>>> > > >>>>>>>>>> Thanks, Peter
>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue
>>>>>>>>> Spark encoding to
>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>>>>>>> implementation: do we
>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation
>>>>>>>>> in Iceberg? Russell
>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark
>>>>>>>>> dependency and that could be a
>>>>>>>>> > > >>>>>>>>>>> problem?
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>> Thanks,
>>>>>>>>> > > >>>>>>>>>>> Aihua
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current
>>>>>>>>> doc is a good
>>>>>>>>> > > >>>>>>>>>>> one. I went
>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it
>>>>>>>>> looks like a
>>>>>>>>> > > >>>>>>>>>>> better choice than
>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly
>>>>>>>>> accessing nested
>>>>>>>>> > > >>>>>>>>>>> fields.
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that
>>>>>>>>> this is what
>>>>>>>>> > > >>>>>>>>>>> Delta's variant
>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables
>>>>>>>>> written by Delta
>>>>>>>>> > > >>>>>>>>>>> could be
>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without
>>>>>>>>> needing to rewrite
>>>>>>>>> > > >>>>>>>>>>> variant
>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and
>>>>>>>>> have an
>>>>>>>>> > > >>>>>>>>>>> interest in
>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > Ryan
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>>>>>>> > > >>>>>>>>>>> [email protected]>
>>>>>>>>> > > >>>>>>>>>>> > wrote:
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant
>>>>>>>>> type proposal
>>>>>>>>> > > >>>>>>>>>>> in the
>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to
>>>>>>>>> host a meeting
>>>>>>>>> > > >>>>>>>>>>> next week
>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any
>>>>>>>>> further
>>>>>>>>> > > >>>>>>>>>>> concerns about the
>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other
>>>>>>>>> questions on the
>>>>>>>>> > > >>>>>>>>>>> first phase of
>>>>>>>>> > > >>>>>>>>>>> > > the proposal
>>>>>>>>> > > >>>>>>>>>>> > > <
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>> > > >>>>>>>>>>> >.
>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in
>>>>>>>>> the proposal
>>>>>>>>> > > >>>>>>>>>>> can either join
>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss
>>>>>>>>> them. Summary
>>>>>>>>> > > >>>>>>>>>>> of the
>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the
>>>>>>>>> mailing list for
>>>>>>>>> > > >>>>>>>>>>> further comment
>>>>>>>>> > > >>>>>>>>>>> > > there.
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>>>>>>> representation
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc
>>>>>>>>> including ION,
>>>>>>>>> > > >>>>>>>>>>> JSONB, and
>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying
>>>>>>>>> encoding is an
>>>>>>>>> > > >>>>>>>>>>> important first step
>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general support
>>>>>>>>> for Spark’s
>>>>>>>>> > > >>>>>>>>>>> Variant encoding.
>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has
>>>>>>>>> strong opinions in
>>>>>>>>> > > >>>>>>>>>>> this space.
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    Should we support multiple logical types or
>>>>>>>>> just Variant?
>>>>>>>>> > > >>>>>>>>>>> Variant vs.
>>>>>>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s)
>>>>>>>>> to be supported
>>>>>>>>> > > >>>>>>>>>>> in Iceberg -
>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types
>>>>>>>>> would share the
>>>>>>>>> > > >>>>>>>>>>> same underlying
>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different limitations
>>>>>>>>> on engines
>>>>>>>>> > > >>>>>>>>>>> working with
>>>>>>>>> > > >>>>>>>>>>> > > those types.
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring
>>>>>>>>> toward
>>>>>>>>> > > >>>>>>>>>>> supporting Variant
>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the
>>>>>>>>> supported
>>>>>>>>> > > >>>>>>>>>>> type(s).
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    -
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >    How should we move forward with
>>>>>>>>> Subcolumnization?
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for
>>>>>>>>> Variant type by
>>>>>>>>> > > >>>>>>>>>>> separating out
>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is
>>>>>>>>> not critical for
>>>>>>>>> > > >>>>>>>>>>> choosing the
>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we
>>>>>>>>> were hoping to
>>>>>>>>> > > >>>>>>>>>>> gain consensus on
>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > Thanks
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > Aihua
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>>>>>>> > > >>>>>>>>>>> > > Video call link:
>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq
>>>>>>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576
>>>>>>>>> 525‬#
>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>>>>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>>>>>>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >
>>>>>>>>> > > >>>>>>>>>>> > >> Hello,
>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>>>>>>> > > >>>>>>>>>>> > >> <
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and
>>>>>>>>> comment.
>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>> > > >>>>>>>>>>> > >> Thanks,
>>>>>>>>> > > >>>>>>>>>>> > >> Aihua
>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>>>>>>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>
>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the
>>>>>>>>> same
>>>>>>>>> > > >>>>>>>>>>> discussion internally
>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with
>>>>>>>>> for example
>>>>>>>>> > > >>>>>>>>>>> the SUPER type in
>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>>>>>>> > > >>>>>>>>>>> and
>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with the
>>>>>>>>> Trino JSON
>>>>>>>>> > > >>>>>>>>>>> type.
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>> > >>> Best,
>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>>>>>>> > > >>>>>>>>>>> > >>> <[email protected]> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>>
>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <
>>>>>>>>> [email protected]>
>>>>>>>>> > > >>>>>>>>>>> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how
>>>>>>>>> many we need to
>>>>>>>>> > > >>>>>>>>>>> look at;
>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but
>>>>>>>>> weren't sure
>>>>>>>>> > > >>>>>>>>>>> how much
>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed
>>>>>>>>> to go。
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the
>>>>>>>>> Java world. It
>>>>>>>>> > > >>>>>>>>>>> would be
>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the
>>>>>>>>> effort it takes to
>>>>>>>>> > > >>>>>>>>>>> integrate
>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox,
>>>>>>>>> datafusion, etc.).
>>>>>>>>> > > >>>>>>>>>>> This is something
>>>>>>>>> > > >>>>>>>>>>> > >>>>> that
>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care
>>>>>>>>> about.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share
>>>>>>>>> some
>>>>>>>>> > > >>>>>>>>>>> perspective on this.
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a
>>>>>>>>> binary type
>>>>>>>>> > > >>>>>>>>>>> and Iceberg and
>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the
>>>>>>>>> binary column
>>>>>>>>> > > >>>>>>>>>>> needs to be
>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be
>>>>>>>>> sufficient.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability,
>>>>>>>>> it would be
>>>>>>>>> > > >>>>>>>>>>> good to support
>>>>>>>>> > > >>>>>>>>>>> > >>>>> native
>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier
>>>>>>>>> for projects
>>>>>>>>> > > >>>>>>>>>>> like Apache
>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide
>>>>>>>>> finer-grained statistics
>>>>>>>>> > > >>>>>>>>>>> for variant
>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which
>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional
>>>>>>>>> value in
>>>>>>>>> > > >>>>>>>>>>> native file format
>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that
>>>>>>>>> it's not a
>>>>>>>>> > > >>>>>>>>>>> strict requirement.
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler
>>>>>>>>> Akidau
>>>>>>>>> > > >>>>>>>>>>> > >>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM
>>>>>>>>> Jean-Baptiste Onofré <
>>>>>>>>> > > >>>>>>>>>>> [email protected]>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It
>>>>>>>>> reminds me our
>>>>>>>>> > > >>>>>>>>>>> discussions back in
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty
>>>>>>>>> interesting. I remember
>>>>>>>>> > > >>>>>>>>>>> some discussions
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The
>>>>>>>>> binary data type
>>>>>>>>> > > >>>>>>>>>>> is already
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and
>>>>>>>>> happy to help
>>>>>>>>> > > >>>>>>>>>>> on this !
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler
>>>>>>>>> Akidau
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are
>>>>>>>>> working on a
>>>>>>>>> > > >>>>>>>>>>> proposal for
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback
>>>>>>>>> from the
>>>>>>>>> > > >>>>>>>>>>> community. As you may know,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its
>>>>>>>>> open Data Lake
>>>>>>>>> > > >>>>>>>>>>> format. Having made
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the
>>>>>>>>> Iceberg
>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not
>>>>>>>>> yet supported in
>>>>>>>>> > > >>>>>>>>>>> Iceberg which we
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users,
>>>>>>>>> and that we
>>>>>>>>> > > >>>>>>>>>>> would like to discuss
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg
>>>>>>>>> community.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like
>>>>>>>>> to discuss are
>>>>>>>>> > > >>>>>>>>>>> in support of
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant
>>>>>>>>> columns. In
>>>>>>>>> > > >>>>>>>>>>> more detail, for
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient
>>>>>>>>> binary
>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro,
>>>>>>>>> etc. By
>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the
>>>>>>>>> flexibility of
>>>>>>>>> > > >>>>>>>>>>> the source data,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more
>>>>>>>>> efficiently
>>>>>>>>> > > >>>>>>>>>>> operate on the data.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data
>>>>>>>>> type on
>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users
>>>>>>>>> utilize Iceberg
>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of
>>>>>>>>> requests for
>>>>>>>>> > > >>>>>>>>>>> variant support.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such
>>>>>>>>> as Apache Spark
>>>>>>>>> > > >>>>>>>>>>> have begun adding
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe
>>>>>>>>> it would be
>>>>>>>>> > > >>>>>>>>>>> beneficial to the
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to
>>>>>>>>> standardize on the
>>>>>>>>> > > >>>>>>>>>>> variant data type
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is
>>>>>>>>> that, since an
>>>>>>>>> > > >>>>>>>>>>> Apache OSS
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already
>>>>>>>>> exists in Spark,
>>>>>>>>> > > >>>>>>>>>>> it likely makes sense
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as
>>>>>>>>> the Iceberg
>>>>>>>>> > > >>>>>>>>>>> standard as well. The
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in
>>>>>>>>> Snowflake is
>>>>>>>>> > > >>>>>>>>>>> slightly different, but
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no
>>>>>>>>> particular value
>>>>>>>>> > > >>>>>>>>>>> in trying to clutter
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>>>>>>> equivalent-but-incompatible
>>>>>>>>> > > >>>>>>>>>>> encoding.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns
>>>>>>>>> allows query
>>>>>>>>> > > >>>>>>>>>>> engines to
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when
>>>>>>>>> subcolumns (i.e.,
>>>>>>>>> > > >>>>>>>>>>> nested fields) within a
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also
>>>>>>>>> allows optionally
>>>>>>>>> > > >>>>>>>>>>> materializing some
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on
>>>>>>>>> their own,
>>>>>>>>> > > >>>>>>>>>>> affording queries on these
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data
>>>>>>>>> and spend
>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system
>>>>>>>>> managing table
>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max,
>>>>>>>>> null, etc.)
>>>>>>>>> > > >>>>>>>>>>> for some subset of the
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also
>>>>>>>>> manages any
>>>>>>>>> > > >>>>>>>>>>> optional
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without
>>>>>>>>> subcolumnarization, any query
>>>>>>>>> > > >>>>>>>>>>> which touches a
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse,
>>>>>>>>> extract, and filter
>>>>>>>>> > > >>>>>>>>>>> every row for which
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by
>>>>>>>>> providing a
>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant
>>>>>>>>> columns,
>>>>>>>>> > > >>>>>>>>>>> Iceberg can make
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible
>>>>>>>>> across various
>>>>>>>>> > > >>>>>>>>>>> catalogs and query
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial
>>>>>>>>> topic, so we
>>>>>>>>> > > >>>>>>>>>>> expect any
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only
>>>>>>>>> the set of
>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query
>>>>>>>>> engines to
>>>>>>>>> > > >>>>>>>>>>> interopate on
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant
>>>>>>>>> columns, but also
>>>>>>>>> > > >>>>>>>>>>> reference
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining
>>>>>>>>> subcolumnarization principles
>>>>>>>>> > > >>>>>>>>>>> and recommended best
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal
>>>>>>>>> [3] may be a
>>>>>>>>> > > >>>>>>>>>>> good starting
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our
>>>>>>>>> plan is to
>>>>>>>>> > > >>>>>>>>>>> write something up in
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec
>>>>>>>>> changes,
>>>>>>>>> > > >>>>>>>>>>> backwards compatibility,
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted
>>>>>>>>> to first reach
>>>>>>>>> > > >>>>>>>>>>> out to the community
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and
>>>>>>>>> see if
>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend
>>>>>>>>> too much time on
>>>>>>>>> > > >>>>>>>>>>> a concrete proposal.
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>
>>>>>>>>> > > >>>>>>>>>>> > >>>>>>
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>> > --
>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue
>>>>>>>>> > > >>>>>>>>>>> > Databricks
>>>>>>>>> > > >>>>>>>>>>> >
>>>>>>>>> > > >>>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>>
>>>>>>>>> > > >>>>>>>>>
>>>>>>>>> > > >>>>>>>>> --
>>>>>>>>> > > >>>>>>>>> Ryan Blue
>>>>>>>>> > > >>>>>>>>> Databricks
>>>>>>>>> > > >>>>>>>>>
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>> --
>>>>>>>>> > > >>>>>>>> Ryan Blue
>>>>>>>>> > > >>>>>>>> Databricks
>>>>>>>>> > > >>>>>>>>
>>>>>>>>> > > >>>>>>>
>>>>>>>>> > > >>>>>>
>>>>>>>>> > > >>>>>> --
>>>>>>>>> > > >>>>>> Ryan Blue
>>>>>>>>> > > >>>>>> Databricks
>>>>>>>>> > > >>>>>>
>>>>>>>>> > > >>>>>
>>>>>>>>> > > >>>>
>>>>>>>>> > > >>>> --
>>>>>>>>> > > >>>> Ryan Blue
>>>>>>>>> > > >>>> Databricks
>>>>>>>>> > > >>>>
>>>>>>>>> > > >>>
>>>>>>>>> > > >>
>>>>>>>>> > > >> --
>>>>>>>>> > > >> Ryan Blue
>>>>>>>>> > > >> Databricks
>>>>>>>>> > > >>
>>>>>>>>> > > >
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Databricks
>>>>>>>
>>>>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to