Hi community, Thanks for joining the meeting to discuss variant shredding. For those who were unable to attend the meeting, please check out the recorded meeting <https://drive.google.com/file/d/1kiwv29nxxOqMCbxXn-NRoz-x2E9yIMlJ/view?usp=drive_link> if you are interested. Also to follow up on the meeting to converge on lossiness discussion from shredding offline, I have converted the spark shredding proposal by David into google doc <https://docs.google.com/document/d/1JeBt4NIju08jQ2AbludiK-U0M9ISIgysP7fUDWtv7rg/edit> and please comment.
Thanks, Aihua On Thu, Jul 25, 2024 at 10:14 AM Aihua Xu <aihu...@gmail.com> wrote: > Yes. This time I was able to record it and I will share it when it’s > processed. > > > On Jul 25, 2024, at 10:01 AM, Amogh Jahagirdar <2am...@gmail.com> wrote: > > > Any chance this meeting was recorded? I couldn't make it but would be > interested in catching up on the discussion. > > Thanks, > > Amogh Jahagirdar > > On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <aihu...@gmail.com> wrote: > >> Thanks folks for additional discussion. >> >> There are some questions related to subcolumniziation (spark shredding - >> see the discussion <https://github.com/apache/spark/pull/46831>) and we >> would like to host another meeting to mainly discuss that since we plan to >> adopt it. We can also follow up the Spark variant topics (I can see that >> mostly we are aligned with the exception to find a place for the spec and >> implementation). Look forward to meeting with you. BTW: should I include >> dev@iceberg.apache.org in the email invite? >> >> Sync up on Variant subcolumnization (shredding) >> Thursday, July 25 · 8:00 – 9:00am >> Time zone: America/Los_Angeles >> Google Meet joining info >> Video call link: https://meet.google.com/mug-dvnv-hnq >> Or dial: (US) +1 904-900-0730 PIN: 671 997 419# >> More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422 >> >> Thanks, >> Aihua >> >> On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> >>> I'm late replying to this but I'm also in agreement with 1 (adopting the >>> spark variant encoding), 3 (specifically only having a variant type), and 4 >>> (ensuring we are thinking through subcolumnarization upfront since without >>> it the variant type may not be that useful). >>> >>> I'd also support having the spec, and reference implementation in >>> Iceberg; as others have said, it centralizes improvements in a single, >>> agnostic dependency for engines, rather than engines having to take >>> dependencies on other engine modules. >>> >>> Thanks, >>> >>> Amogh Jahagirdar >>> >>> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> I have been looking around, how can we map Variant type in Flink. I >>>> have not found any existing type which we could use, but Flink already have >>>> some JSON parsing capabilities [1] for string fields. >>>> >>>> So until we have native support in Flink for something similar to >>>> Vartiant type, I expect that we need to map it to JSON strings in RowData. >>>> >>>> Based on that, here are my preferences: >>>> 1. I'm ok with adapting Spark Variant type, if we build our own Iceberg >>>> serializer/deserializer module for it >>>> 2. I prefer to move the spec to Iceberg, so we own it, and extend it, >>>> if needed. This could be important in the first phase. Later when it is >>>> more stable we might donate it to some other project, like Parquet >>>> 3. I would prefer to support only a single type, and Variant is more >>>> expressive, but having a standard way to convert between JSON and Variant >>>> would be useful for Flink users. >>>> 4. On subcolumnarization: I think Flink will only use this feature as >>>> much as the Iceberg readers implement this, so I would like to see as much >>>> as possible of it in the common Iceberg code >>>> >>>> Thanks, >>>> Peter >>>> >>>> [1] - >>>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions >>>> >>>> >>>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <emkornfi...@gmail.com> >>>> wrote: >>>> >>>>> Sorry for the late reply. I agree with the sentiments on 1 and 3 that >>>>> have already been posted (adopt the Spark encoding, and only have the >>>>> Variant type). As mentioned on the doc for 3, I think it would be good to >>>>> specify how to map scalar types to a JSON representation so there can be >>>>> consistency between engines that don't support variant. >>>>> >>>>> >>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a >>>>>> subproject for variant spec and implementation. But let me reach out to >>>>>> the >>>>>> Spark community to discuss. >>>>> >>>>> >>>>> The only other place I can think of that might be a good home for >>>>> Variant spec could be in Apache Arrow as a canonical extension type. There >>>>> is an issue for this [1]. I think the main thing on where this is housed >>>>> is which types are intended to be supported. I believe Arrow is currently >>>>> a superset of the Iceberg type system (UUID is supported as a canonical >>>>> extension type [2]). >>>>> >>>>> For point 4 subcolumnarization, I think ideally this belongs in >>>>> Iceberg (and if Iceberg and Delta Lake can agree on how to do it that >>>>> would >>>>> be great) with potential consultation with Parquet/ORC communities to >>>>> potentially add better native support. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> >>>>> >>>>> [1] https://github.com/apache/arrow/issues/42069 >>>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html >>>>> >>>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote: >>>>> >>>>>> Thanks for the discussion and feedback. >>>>>> >>>>>> Do we have the consensus on point 1 and point 3 to move forward with >>>>>> Spark variant encoding and support Variant type only? Or let me know how >>>>>> to >>>>>> proceed from here. >>>>>> >>>>>> Regarding point 2, I also feel Iceberg is more natural to host such a >>>>>> subproject for variant spec and implementation. But let me reach out to >>>>>> the >>>>>> Spark community to discuss. >>>>>> >>>>>> Thanks, >>>>>> Aihua >>>>>> >>>>>> >>>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Agreed with point 1. >>>>>>> >>>>>>> For point 2, I also prefer to hold the spec and reference >>>>>>> implementation under Iceberg. Here are the reasons: >>>>>>> 1. It is unconventional and impractical for one engine to depend on >>>>>>> another for data types. For instance, it is not ideal for Trino to rely >>>>>>> on >>>>>>> data types defined by the Spark engine. >>>>>>> 2. Iceberg serves as a bridge between engines and file formats. By >>>>>>> centralizing the specification in Iceberg, any future optimizations or >>>>>>> updates to file formats can be referred to within Iceberg, ensuring >>>>>>> consistency and reducing dependencies. >>>>>>> >>>>>>> For point 3, I'd prefer to support the variant type only at this >>>>>>> moment. >>>>>>> >>>>>>> Yufei >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue >>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>> >>>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support only >>>>>>>> variant for point 3. >>>>>>>> >>>>>>>> We'll need to work with the Spark community to find a good place >>>>>>>> for the library and spec, since it touches many different projects. I'd >>>>>>>> also prefer Iceberg as the home. >>>>>>>> >>>>>>>> I also think it's a good idea to get subcolumnarization into our >>>>>>>> spec when we update. Without that I think the feature will be fairly >>>>>>>> limited. >>>>>>>> >>>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I'm aligned with point 1. >>>>>>>>> >>>>>>>>> For point 2 I think we should choose quickly, I honestly do think >>>>>>>>> this would be fine as part of the Iceberg Spec directly but >>>>>>>>> understand it >>>>>>>>> may be better for the more broad community if it was a sub project. >>>>>>>>> As a >>>>>>>>> sub-project I would still prefer it being an Iceberg Subproject since >>>>>>>>> we >>>>>>>>> are engine/file-format agnostic. >>>>>>>>> >>>>>>>>> 3. I support adding just Variant. >>>>>>>>> >>>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello community, >>>>>>>>>> >>>>>>>>>> It’s great to sync up with some of you on Variant and >>>>>>>>>> SubColumarization support in Iceberg again. Apologize that I didn’t >>>>>>>>>> record >>>>>>>>>> the meeting but here are some key items that we want to follow up >>>>>>>>>> with the >>>>>>>>>> community. >>>>>>>>>> >>>>>>>>>> 1. Adopt Spark Variant encoding >>>>>>>>>> Those present were in favor of adopting the Spark variant >>>>>>>>>> encoding for Iceberg Variant with extensions to support other Iceberg >>>>>>>>>> types. We would like to know if anyone has an objection to this to >>>>>>>>>> reuse an >>>>>>>>>> open source encoding. >>>>>>>>>> >>>>>>>>>> 2. Movement of the Spark Variant Spec to another project >>>>>>>>>> To avoid introducing Apache Spark as a dependency for the engines >>>>>>>>>> and file formats, we discussed separating Spark Variant encoding >>>>>>>>>> spec and >>>>>>>>>> implementation from the Spark Project to a neutral location. We >>>>>>>>>> thought up >>>>>>>>>> several solutions but didn’t have consensus on any of them. We are >>>>>>>>>> looking >>>>>>>>>> for more feedback on this topic from the community either in terms of >>>>>>>>>> support for one of these options or another idea on how to support >>>>>>>>>> the spec. >>>>>>>>>> >>>>>>>>>> Options Proposed: >>>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other >>>>>>>>>> engines) >>>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult for >>>>>>>>>> other Table Formats) >>>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec >>>>>>>>>> and reference implementation there (Logistically complicated) >>>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec and >>>>>>>>>> reference implementation there (Logistically complicated) >>>>>>>>>> >>>>>>>>>> 3. Add Variant type vs. Variant and JSON types >>>>>>>>>> Those who were present were in favor of adding only the Variant >>>>>>>>>> type to Iceberg. We are looking for anyone who has an objection to >>>>>>>>>> going >>>>>>>>>> forward with just the Variant Type and no Iceberg JSON Type. We were >>>>>>>>>> favoring adding Variant type only because: >>>>>>>>>> * Introducing a JSON type would require engines that only support >>>>>>>>>> VARIANT to do write time validation of their input to a JSON column. >>>>>>>>>> If >>>>>>>>>> they don’t have a JSON type an engine wouldn’t support this. >>>>>>>>>> * Engines which don’t support Variant will work most of the time >>>>>>>>>> but can have fallback strings defined in the spec for reading >>>>>>>>>> unsupported >>>>>>>>>> types. Writing a JSON into a Variant will always work. >>>>>>>>>> >>>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark) >>>>>>>>>> We have no action items on this but would like to follow up on >>>>>>>>>> discussions on Subcolumnization in the future. >>>>>>>>>> * We had general agreement that this should be included in >>>>>>>>>> Iceberg V3 or else adding variant may not be useful. >>>>>>>>>> * We are interested in also adopting the shredding spec from >>>>>>>>>> Spark and would like to move it to whatever place we decided the >>>>>>>>>> Variant >>>>>>>>>> spec is going to live. >>>>>>>>>> >>>>>>>>>> Let us know if missed anything and if you have any additional >>>>>>>>>> thoughts or suggestions. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Aihua >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote: >>>>>>>>>> > Thanks for the discussion. >>>>>>>>>> > >>>>>>>>>> > I will move forward to work on spec PR. >>>>>>>>>> > >>>>>>>>>> > Regarding the implementation, we will have module for Variant >>>>>>>>>> support in Iceberg so we will not have to bring in Spark libraries. >>>>>>>>>> > >>>>>>>>>> > I'm reposting the meeting invite in case it's not clear in my >>>>>>>>>> original email since I included in the end. Looks like we don't have >>>>>>>>>> major >>>>>>>>>> objections/diverges but let's sync up and have consensus. >>>>>>>>>> > >>>>>>>>>> > Meeting invite: >>>>>>>>>> > >>>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>>> > Time zone: America/Los_Angeles >>>>>>>>>> > Google Meet joining info >>>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>>>>>>> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>>>>>>> > More phone numbers: >>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>>> > >>>>>>>>>> > Thanks, >>>>>>>>>> > Aihua >>>>>>>>>> > >>>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >>>>>>>>>> > > I don't think this needs to hold up the PR but I think coming >>>>>>>>>> to a >>>>>>>>>> > > consensus on the exact set of types supported is worthwhile >>>>>>>>>> (and if the >>>>>>>>>> > > goal is to maintain the same set as specified by the Spark >>>>>>>>>> Variant type or >>>>>>>>>> > > if divergence is expected/allowed). From a fragmentation >>>>>>>>>> perspective it >>>>>>>>>> > > would be a shame if they diverge, so maybe a next step is >>>>>>>>>> also suggesting >>>>>>>>>> > > support to the Spark community on the missing existing >>>>>>>>>> Iceberg types? >>>>>>>>>> > > >>>>>>>>>> > > Thanks, >>>>>>>>>> > > Micah >>>>>>>>>> > > >>>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> >>>>>>>>>> > > wrote: >>>>>>>>>> > > >>>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR now. >>>>>>>>>> We can get >>>>>>>>>> > > > feedback there from everyone. >>>>>>>>>> > > > >>>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>> > > > wrote: >>>>>>>>>> > > > >>>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get >>>>>>>>>> their feedback in >>>>>>>>>> > > >> parallel to getting the spec changes started. Piotr didn't >>>>>>>>>> seem to object >>>>>>>>>> > > >> to the encoding from what I read of his comments. >>>>>>>>>> Hopefully he (and others) >>>>>>>>>> > > >> chime in here. >>>>>>>>>> > > >> >>>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >>>>>>>>>> > > >> russell.spit...@gmail.com> wrote: >>>>>>>>>> > > >> >>>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on board >>>>>>>>>> as >>>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make >>>>>>>>>> sure we have anyone >>>>>>>>>> > > >>> else chime in who has experience with Ray if possible. >>>>>>>>>> > > >>> >>>>>>>>>> > > >>> Spec changes feel like the right next step. >>>>>>>>>> > > >>> >>>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>> > > >>> wrote: >>>>>>>>>> > > >>> >>>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has >>>>>>>>>> been out for >>>>>>>>>> > > >>>> quite a while and I don't see any major objections to >>>>>>>>>> using the Spark >>>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need >>>>>>>>>> well. It can also be >>>>>>>>>> > > >>>> extended to support additional types that are missing if >>>>>>>>>> that's a priority. >>>>>>>>>> > > >>>> >>>>>>>>>> > > >>>> Should we move forward by starting a draft of the >>>>>>>>>> changes to the table >>>>>>>>>> > > >>>> spec? Then we can vote on committing those changes and >>>>>>>>>> get moving on an >>>>>>>>>> > > >>>> implementation (or possibly do the implementation in >>>>>>>>>> parallel). >>>>>>>>>> > > >>>> >>>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >>>>>>>>>> > > >>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> > > >>>> >>>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module. >>>>>>>>>> > > >>>>> >>>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >>>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>>> > > >>>>> wrote: >>>>>>>>>> > > >>>>> >>>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in >>>>>>>>>> parquet proper >>>>>>>>>> > > >>>>>> right? >>>>>>>>>> > > >>>>>> >>>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it >>>>>>>>>> should end up. >>>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it? >>>>>>>>>> > > >>>>>> >>>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >>>>>>>>>> > > >>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> > > >>>>>> >>>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in >>>>>>>>>> parquet proper >>>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg >>>>>>>>>> though for the time >>>>>>>>>> > > >>>>>>> being. >>>>>>>>>> > > >>>>>>> >>>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >>>>>>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>>> > > >>>>>>> >>>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought >>>>>>>>>> this up in his >>>>>>>>>> > > >>>>>>>> last email: >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark >>>>>>>>>> implementation in >>>>>>>>>> > > >>>>>>>> Iceberg? >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the >>>>>>>>>> Spark library. What >>>>>>>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg? >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> Ryan >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >>>>>>>>>> b...@databricks.com> >>>>>>>>>> > > >>>>>>>> wrote: >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a >>>>>>>>>> comment on the doc >>>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact >>>>>>>>>> that would be a much >>>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of >>>>>>>>>> Spark, but I even then I >>>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend >>>>>>>>>> on that because it is a >>>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton >>>>>>>>>> of Scala libs. I think >>>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an independent >>>>>>>>>> implementation of the >>>>>>>>>> > > >>>>>>>>> spec in Iceberg. >>>>>>>>>> > > >>>>>>>>> >>>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>>>>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>> > > >>>>>>>>> >>>>>>>>>> > > >>>>>>>>>> Hi Aihua, >>>>>>>>>> > > >>>>>>>>>> Long time no see :) >>>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to >>>>>>>>>> support Variant >>>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like >>>>>>>>>> Flink/Trino/Hive etc? >>>>>>>>>> > > >>>>>>>>>> Thanks, Peter >>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu < >>>>>>>>>> aihu...@apache.org> wrote: >>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>>> Thanks Ryan. >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue >>>>>>>>>> Spark encoding to >>>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines. >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding >>>>>>>>>> implementation: do we >>>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark >>>>>>>>>> implementation in Iceberg? Russell >>>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark >>>>>>>>>> dependency and that could be a >>>>>>>>>> > > >>>>>>>>>>> problem? >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>>> Thanks, >>>>>>>>>> > > >>>>>>>>>>> Aihua >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua! >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current >>>>>>>>>> doc is a good >>>>>>>>>> > > >>>>>>>>>>> one. I went >>>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it >>>>>>>>>> looks like a >>>>>>>>>> > > >>>>>>>>>>> better choice than >>>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly >>>>>>>>>> accessing nested >>>>>>>>>> > > >>>>>>>>>>> fields. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that >>>>>>>>>> this is what >>>>>>>>>> > > >>>>>>>>>>> Delta's variant >>>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables >>>>>>>>>> written by Delta >>>>>>>>>> > > >>>>>>>>>>> could be >>>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without >>>>>>>>>> needing to rewrite >>>>>>>>>> > > >>>>>>>>>>> variant >>>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and >>>>>>>>>> have an >>>>>>>>>> > > >>>>>>>>>>> interest in >>>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.) >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > Ryan >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >>>>>>>>>> > > >>>>>>>>>>> > wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant >>>>>>>>>> type proposal >>>>>>>>>> > > >>>>>>>>>>> in the >>>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to >>>>>>>>>> host a meeting >>>>>>>>>> > > >>>>>>>>>>> next week >>>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any >>>>>>>>>> further >>>>>>>>>> > > >>>>>>>>>>> concerns about the >>>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other >>>>>>>>>> questions on the >>>>>>>>>> > > >>>>>>>>>>> first phase of >>>>>>>>>> > > >>>>>>>>>>> > > the proposal >>>>>>>>>> > > >>>>>>>>>>> > > < >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>>> > > >>>>>>>>>>> >. >>>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested >>>>>>>>>> in the proposal >>>>>>>>>> > > >>>>>>>>>>> can either join >>>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can >>>>>>>>>> discuss them. Summary >>>>>>>>>> > > >>>>>>>>>>> of the >>>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the >>>>>>>>>> mailing list for >>>>>>>>>> > > >>>>>>>>>>> further comment >>>>>>>>>> > > >>>>>>>>>>> > > there. >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > What should be the underlying binary >>>>>>>>>> representation >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc >>>>>>>>>> including ION, >>>>>>>>>> > > >>>>>>>>>>> JSONB, and >>>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying >>>>>>>>>> encoding is an >>>>>>>>>> > > >>>>>>>>>>> important first step >>>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general support >>>>>>>>>> for Spark’s >>>>>>>>>> > > >>>>>>>>>>> Variant encoding. >>>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has >>>>>>>>>> strong opinions in >>>>>>>>>> > > >>>>>>>>>>> this space. >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Should we support multiple logical types >>>>>>>>>> or just Variant? >>>>>>>>>> > > >>>>>>>>>>> Variant vs. >>>>>>>>>> > > >>>>>>>>>>> > > Variant + JSON. >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) >>>>>>>>>> to be supported >>>>>>>>>> > > >>>>>>>>>>> in Iceberg - >>>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types >>>>>>>>>> would share the >>>>>>>>>> > > >>>>>>>>>>> same underlying >>>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different >>>>>>>>>> limitations on engines >>>>>>>>>> > > >>>>>>>>>>> working with >>>>>>>>>> > > >>>>>>>>>>> > > those types. >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more >>>>>>>>>> favoring toward >>>>>>>>>> > > >>>>>>>>>>> supporting Variant >>>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the >>>>>>>>>> supported >>>>>>>>>> > > >>>>>>>>>>> type(s). >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > How should we move forward with >>>>>>>>>> Subcolumnization? >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for >>>>>>>>>> Variant type by >>>>>>>>>> > > >>>>>>>>>>> separating out >>>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is >>>>>>>>>> not critical for >>>>>>>>>> > > >>>>>>>>>>> choosing the >>>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we >>>>>>>>>> were hoping to >>>>>>>>>> > > >>>>>>>>>>> gain consensus on >>>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec. >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Thanks >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Aihua >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Meeting invite: >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >>>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info >>>>>>>>>> > > >>>>>>>>>>> > > Video call link: >>>>>>>>>> https://meet.google.com/pbm-ovzn-aoq >>>>>>>>>> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 >>>>>>>>>> 525# >>>>>>>>>> > > >>>>>>>>>>> > > More phone numbers: >>>>>>>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >>>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>>> > >> Hello, >>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal >>>>>>>>>> > > >>>>>>>>>>> > >> < >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review >>>>>>>>>> and comment. >>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>> > > >>>>>>>>>>> > >> Thanks, >>>>>>>>>> > > >>>>>>>>>>> > >> Aihua >>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >>>>>>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had >>>>>>>>>> the same >>>>>>>>>> > > >>>>>>>>>>> discussion internally >>>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with >>>>>>>>>> for example >>>>>>>>>> > > >>>>>>>>>>> the SUPER type in >>>>>>>>>> > > >>>>>>>>>>> > >>> Redshift: >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>>>>>>>> > > >>>>>>>>>>> and >>>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with >>>>>>>>>> the Trino JSON >>>>>>>>>> > > >>>>>>>>>>> type. >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> > >>> Best, >>>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>>>>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < >>>>>>>>>> ust...@gmail.com> >>>>>>>>>> > > >>>>>>>>>>> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how >>>>>>>>>> many we need to >>>>>>>>>> > > >>>>>>>>>>> look at; >>>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, >>>>>>>>>> but weren't sure >>>>>>>>>> > > >>>>>>>>>>> how much >>>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed >>>>>>>>>> to go。 >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the >>>>>>>>>> Java world. It >>>>>>>>>> > > >>>>>>>>>>> would be >>>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the >>>>>>>>>> effort it takes to >>>>>>>>>> > > >>>>>>>>>>> integrate >>>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, >>>>>>>>>> datafusion, etc.). >>>>>>>>>> > > >>>>>>>>>>> This is something >>>>>>>>>> > > >>>>>>>>>>> > >>>>> that >>>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also >>>>>>>>>> care about. >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to >>>>>>>>>> share some >>>>>>>>>> > > >>>>>>>>>>> perspective on this. >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's >>>>>>>>>> a binary type >>>>>>>>>> > > >>>>>>>>>>> and Iceberg and >>>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the >>>>>>>>>> binary column >>>>>>>>>> > > >>>>>>>>>>> needs to be >>>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should >>>>>>>>>> be sufficient. >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, >>>>>>>>>> it would be >>>>>>>>>> > > >>>>>>>>>>> good to support >>>>>>>>>> > > >>>>>>>>>>> > >>>>> native >>>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier >>>>>>>>>> for projects >>>>>>>>>> > > >>>>>>>>>>> like Apache >>>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable. >>>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide >>>>>>>>>> finer-grained statistics >>>>>>>>>> > > >>>>>>>>>>> for variant >>>>>>>>>> > > >>>>>>>>>>> > >>>>> type which >>>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional >>>>>>>>>> value in >>>>>>>>>> > > >>>>>>>>>>> native file format >>>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that >>>>>>>>>> it's not a >>>>>>>>>> > > >>>>>>>>>>> strict requirement. >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler >>>>>>>>>> Akidau >>>>>>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>>> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! >>>>>>>>>> Thanks! >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM >>>>>>>>>> Jean-Baptiste Onofré < >>>>>>>>>> > > >>>>>>>>>>> j...@nanthrax.net> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It >>>>>>>>>> reminds me our >>>>>>>>>> > > >>>>>>>>>>> discussions back in >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty >>>>>>>>>> interesting. I remember >>>>>>>>>> > > >>>>>>>>>>> some discussions >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The >>>>>>>>>> binary data type >>>>>>>>>> > > >>>>>>>>>>> is already >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and >>>>>>>>>> happy to help >>>>>>>>>> > > >>>>>>>>>>> on this ! >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler >>>>>>>>>> Akidau >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>>> wrote: >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) >>>>>>>>>> are working on a >>>>>>>>>> > > >>>>>>>>>>> proposal for >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback >>>>>>>>>> from the >>>>>>>>>> > > >>>>>>>>>>> community. As you may know, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its >>>>>>>>>> open Data Lake >>>>>>>>>> > > >>>>>>>>>>> format. Having made >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of >>>>>>>>>> the Iceberg >>>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not >>>>>>>>>> yet supported in >>>>>>>>>> > > >>>>>>>>>>> Iceberg which we >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, >>>>>>>>>> and that we >>>>>>>>>> > > >>>>>>>>>>> would like to discuss >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg >>>>>>>>>> community. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like >>>>>>>>>> to discuss are >>>>>>>>>> > > >>>>>>>>>>> in support of >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, >>>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of >>>>>>>>>> variant columns. In >>>>>>>>>> > > >>>>>>>>>>> more detail, for >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient >>>>>>>>>> binary >>>>>>>>>> > > >>>>>>>>>>> encoding of dynamic >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, >>>>>>>>>> Avro, etc. By >>>>>>>>>> > > >>>>>>>>>>> encoding semi-structured >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the >>>>>>>>>> flexibility of >>>>>>>>>> > > >>>>>>>>>>> the source data, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more >>>>>>>>>> efficiently >>>>>>>>>> > > >>>>>>>>>>> operate on the data. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant >>>>>>>>>> data type on >>>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users >>>>>>>>>> utilize Iceberg >>>>>>>>>> > > >>>>>>>>>>> tables in Snowflake, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of >>>>>>>>>> requests for >>>>>>>>>> > > >>>>>>>>>>> variant support. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such >>>>>>>>>> as Apache Spark >>>>>>>>>> > > >>>>>>>>>>> have begun adding >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we >>>>>>>>>> believe it would be >>>>>>>>>> > > >>>>>>>>>>> beneficial to the >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to >>>>>>>>>> standardize on the >>>>>>>>>> > > >>>>>>>>>>> variant data type >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is >>>>>>>>>> that, since an >>>>>>>>>> > > >>>>>>>>>>> Apache OSS >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already >>>>>>>>>> exists in Spark, >>>>>>>>>> > > >>>>>>>>>>> it likely makes sense >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as >>>>>>>>>> the Iceberg >>>>>>>>>> > > >>>>>>>>>>> standard as well. The >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in >>>>>>>>>> Snowflake is >>>>>>>>>> > > >>>>>>>>>>> slightly different, but >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no >>>>>>>>>> particular value >>>>>>>>>> > > >>>>>>>>>>> in trying to clutter >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another >>>>>>>>>> equivalent-but-incompatible >>>>>>>>>> > > >>>>>>>>>>> encoding. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns >>>>>>>>>> allows query >>>>>>>>>> > > >>>>>>>>>>> engines to >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when >>>>>>>>>> subcolumns (i.e., >>>>>>>>>> > > >>>>>>>>>>> nested fields) within a >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also >>>>>>>>>> allows optionally >>>>>>>>>> > > >>>>>>>>>>> materializing some >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on >>>>>>>>>> their own, >>>>>>>>>> > > >>>>>>>>>>> affording queries on these >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less >>>>>>>>>> data and spend >>>>>>>>>> > > >>>>>>>>>>> less CPU on extraction. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system >>>>>>>>>> managing table >>>>>>>>>> > > >>>>>>>>>>> metadata and data tracks >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, >>>>>>>>>> max, null, etc.) >>>>>>>>>> > > >>>>>>>>>>> for some subset of the >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and >>>>>>>>>> also manages any >>>>>>>>>> > > >>>>>>>>>>> optional >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without >>>>>>>>>> subcolumnarization, any query >>>>>>>>>> > > >>>>>>>>>>> which touches a >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, >>>>>>>>>> extract, and filter >>>>>>>>>> > > >>>>>>>>>>> every row for which >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by >>>>>>>>>> providing a >>>>>>>>>> > > >>>>>>>>>>> standardized way of tracking >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant >>>>>>>>>> columns, >>>>>>>>>> > > >>>>>>>>>>> Iceberg can make >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible >>>>>>>>>> across various >>>>>>>>>> > > >>>>>>>>>>> catalogs and query >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial >>>>>>>>>> topic, so we >>>>>>>>>> > > >>>>>>>>>>> expect any >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only >>>>>>>>>> the set of >>>>>>>>>> > > >>>>>>>>>>> changes to Iceberg >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query >>>>>>>>>> engines to >>>>>>>>>> > > >>>>>>>>>>> interopate on >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant >>>>>>>>>> columns, but also >>>>>>>>>> > > >>>>>>>>>>> reference >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining >>>>>>>>>> subcolumnarization principles >>>>>>>>>> > > >>>>>>>>>>> and recommended best >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo >>>>>>>>>> proposal [3] may be a >>>>>>>>>> > > >>>>>>>>>>> good starting >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our >>>>>>>>>> plan is to >>>>>>>>>> > > >>>>>>>>>>> write something up in >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec >>>>>>>>>> changes, >>>>>>>>>> > > >>>>>>>>>>> backwards compatibility, >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted >>>>>>>>>> to first reach >>>>>>>>>> > > >>>>>>>>>>> out to the community >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, >>>>>>>>>> and see if >>>>>>>>>> > > >>>>>>>>>>> there’s any early feedback >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend >>>>>>>>>> too much time on >>>>>>>>>> > > >>>>>>>>>>> a concrete proposal. >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you! >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1] >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2] >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3] >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> > -- >>>>>>>>>> > > >>>>>>>>>>> > Ryan Blue >>>>>>>>>> > > >>>>>>>>>>> > Databricks >>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>>> >>>>>>>>>> > > >>>>>>>>>> >>>>>>>>>> > > >>>>>>>>> >>>>>>>>>> > > >>>>>>>>> -- >>>>>>>>>> > > >>>>>>>>> Ryan Blue >>>>>>>>>> > > >>>>>>>>> Databricks >>>>>>>>>> > > >>>>>>>>> >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>>> -- >>>>>>>>>> > > >>>>>>>> Ryan Blue >>>>>>>>>> > > >>>>>>>> Databricks >>>>>>>>>> > > >>>>>>>> >>>>>>>>>> > > >>>>>>> >>>>>>>>>> > > >>>>>> >>>>>>>>>> > > >>>>>> -- >>>>>>>>>> > > >>>>>> Ryan Blue >>>>>>>>>> > > >>>>>> Databricks >>>>>>>>>> > > >>>>>> >>>>>>>>>> > > >>>>> >>>>>>>>>> > > >>>> >>>>>>>>>> > > >>>> -- >>>>>>>>>> > > >>>> Ryan Blue >>>>>>>>>> > > >>>> Databricks >>>>>>>>>> > > >>>> >>>>>>>>>> > > >>> >>>>>>>>>> > > >> >>>>>>>>>> > > >> -- >>>>>>>>>> > > >> Ryan Blue >>>>>>>>>> > > >> Databricks >>>>>>>>>> > > >> >>>>>>>>>> > > > >>>>>>>>>> > > >>>>>>>>>> > >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Databricks >>>>>>>> >>>>>>>