Any chance this meeting was recorded? I couldn't make it but would be interested in catching up on the discussion.
Thanks, Amogh Jahagirdar On Tue, Jul 23, 2024 at 11:30 AM Aihua Xu <aihu...@gmail.com> wrote: > Thanks folks for additional discussion. > > There are some questions related to subcolumniziation (spark shredding - > see the discussion <https://github.com/apache/spark/pull/46831>) and we > would like to host another meeting to mainly discuss that since we plan to > adopt it. We can also follow up the Spark variant topics (I can see that > mostly we are aligned with the exception to find a place for the spec and > implementation). Look forward to meeting with you. BTW: should I include > dev@iceberg.apache.org in the email invite? > > Sync up on Variant subcolumnization (shredding) > Thursday, July 25 · 8:00 – 9:00am > Time zone: America/Los_Angeles > Google Meet joining info > Video call link: https://meet.google.com/mug-dvnv-hnq > Or dial: (US) +1 904-900-0730 PIN: 671 997 419# > More phone numbers: https://tel.meet/mug-dvnv-hnq?pin=1654043233422 > > Thanks, > Aihua > > On Tue, Jul 23, 2024 at 6:36 AM Amogh Jahagirdar <2am...@gmail.com> wrote: > >> I'm late replying to this but I'm also in agreement with 1 (adopting the >> spark variant encoding), 3 (specifically only having a variant type), and 4 >> (ensuring we are thinking through subcolumnarization upfront since without >> it the variant type may not be that useful). >> >> I'd also support having the spec, and reference implementation in >> Iceberg; as others have said, it centralizes improvements in a single, >> agnostic dependency for engines, rather than engines having to take >> dependencies on other engine modules. >> >> Thanks, >> >> Amogh Jahagirdar >> >> On Tue, Jul 23, 2024 at 12:15 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> I have been looking around, how can we map Variant type in Flink. I have >>> not found any existing type which we could use, but Flink already have some >>> JSON parsing capabilities [1] for string fields. >>> >>> So until we have native support in Flink for something similar to >>> Vartiant type, I expect that we need to map it to JSON strings in RowData. >>> >>> Based on that, here are my preferences: >>> 1. I'm ok with adapting Spark Variant type, if we build our own Iceberg >>> serializer/deserializer module for it >>> 2. I prefer to move the spec to Iceberg, so we own it, and extend it, if >>> needed. This could be important in the first phase. Later when it is more >>> stable we might donate it to some other project, like Parquet >>> 3. I would prefer to support only a single type, and Variant is more >>> expressive, but having a standard way to convert between JSON and Variant >>> would be useful for Flink users. >>> 4. On subcolumnarization: I think Flink will only use this feature as >>> much as the Iceberg readers implement this, so I would like to see as much >>> as possible of it in the common Iceberg code >>> >>> Thanks, >>> Peter >>> >>> [1] - >>> https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/table/functions/systemfunctions/#json-functions >>> >>> >>> On Tue, Jul 23, 2024, 06:36 Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> Sorry for the late reply. I agree with the sentiments on 1 and 3 that >>>> have already been posted (adopt the Spark encoding, and only have the >>>> Variant type). As mentioned on the doc for 3, I think it would be good to >>>> specify how to map scalar types to a JSON representation so there can be >>>> consistency between engines that don't support variant. >>>> >>>> >>>>> Regarding point 2, I also feel Iceberg is more natural to host such a >>>>> subproject for variant spec and implementation. But let me reach out to >>>>> the >>>>> Spark community to discuss. >>>> >>>> >>>> The only other place I can think of that might be a good home for >>>> Variant spec could be in Apache Arrow as a canonical extension type. There >>>> is an issue for this [1]. I think the main thing on where this is housed >>>> is which types are intended to be supported. I believe Arrow is currently >>>> a superset of the Iceberg type system (UUID is supported as a canonical >>>> extension type [2]). >>>> >>>> For point 4 subcolumnarization, I think ideally this belongs in Iceberg >>>> (and if Iceberg and Delta Lake can agree on how to do it that would be >>>> great) with potential consultation with Parquet/ORC communities to >>>> potentially add better native support. >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> >>>> [1] https://github.com/apache/arrow/issues/42069 >>>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html >>>> >>>> On Sat, Jul 20, 2024 at 5:54 PM Aihua Xu <aihu...@gmail.com> wrote: >>>> >>>>> Thanks for the discussion and feedback. >>>>> >>>>> Do we have the consensus on point 1 and point 3 to move forward with >>>>> Spark variant encoding and support Variant type only? Or let me know how >>>>> to >>>>> proceed from here. >>>>> >>>>> Regarding point 2, I also feel Iceberg is more natural to host such a >>>>> subproject for variant spec and implementation. But let me reach out to >>>>> the >>>>> Spark community to discuss. >>>>> >>>>> Thanks, >>>>> Aihua >>>>> >>>>> >>>>> On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> wrote: >>>>> >>>>>> Agreed with point 1. >>>>>> >>>>>> For point 2, I also prefer to hold the spec and reference >>>>>> implementation under Iceberg. Here are the reasons: >>>>>> 1. It is unconventional and impractical for one engine to depend on >>>>>> another for data types. For instance, it is not ideal for Trino to rely >>>>>> on >>>>>> data types defined by the Spark engine. >>>>>> 2. Iceberg serves as a bridge between engines and file formats. By >>>>>> centralizing the specification in Iceberg, any future optimizations or >>>>>> updates to file formats can be referred to within Iceberg, ensuring >>>>>> consistency and reducing dependencies. >>>>>> >>>>>> For point 3, I'd prefer to support the variant type only at this >>>>>> moment. >>>>>> >>>>>> Yufei >>>>>> >>>>>> >>>>>> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue >>>>>> <b...@databricks.com.invalid> wrote: >>>>>> >>>>>>> Similarly, I'm aligned with point 1 and I'd choose to support only >>>>>>> variant for point 3. >>>>>>> >>>>>>> We'll need to work with the Spark community to find a good place for >>>>>>> the library and spec, since it touches many different projects. I'd also >>>>>>> prefer Iceberg as the home. >>>>>>> >>>>>>> I also think it's a good idea to get subcolumnarization into our >>>>>>> spec when we update. Without that I think the feature will be fairly >>>>>>> limited. >>>>>>> >>>>>>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer < >>>>>>> russell.spit...@gmail.com> wrote: >>>>>>> >>>>>>>> I'm aligned with point 1. >>>>>>>> >>>>>>>> For point 2 I think we should choose quickly, I honestly do think >>>>>>>> this would be fine as part of the Iceberg Spec directly but understand >>>>>>>> it >>>>>>>> may be better for the more broad community if it was a sub project. As >>>>>>>> a >>>>>>>> sub-project I would still prefer it being an Iceberg Subproject since >>>>>>>> we >>>>>>>> are engine/file-format agnostic. >>>>>>>> >>>>>>>> 3. I support adding just Variant. >>>>>>>> >>>>>>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello community, >>>>>>>>> >>>>>>>>> It’s great to sync up with some of you on Variant and >>>>>>>>> SubColumarization support in Iceberg again. Apologize that I didn’t >>>>>>>>> record >>>>>>>>> the meeting but here are some key items that we want to follow up >>>>>>>>> with the >>>>>>>>> community. >>>>>>>>> >>>>>>>>> 1. Adopt Spark Variant encoding >>>>>>>>> Those present were in favor of adopting the Spark variant >>>>>>>>> encoding for Iceberg Variant with extensions to support other Iceberg >>>>>>>>> types. We would like to know if anyone has an objection to this to >>>>>>>>> reuse an >>>>>>>>> open source encoding. >>>>>>>>> >>>>>>>>> 2. Movement of the Spark Variant Spec to another project >>>>>>>>> To avoid introducing Apache Spark as a dependency for the engines >>>>>>>>> and file formats, we discussed separating Spark Variant encoding spec >>>>>>>>> and >>>>>>>>> implementation from the Spark Project to a neutral location. We >>>>>>>>> thought up >>>>>>>>> several solutions but didn’t have consensus on any of them. We are >>>>>>>>> looking >>>>>>>>> for more feedback on this topic from the community either in terms of >>>>>>>>> support for one of these options or another idea on how to support >>>>>>>>> the spec. >>>>>>>>> >>>>>>>>> Options Proposed: >>>>>>>>> * Leave the Spec in Spark (Difficult for versioning and other >>>>>>>>> engines) >>>>>>>>> * Copying the Spec into Iceberg Project Directly (Difficult for >>>>>>>>> other Table Formats) >>>>>>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and >>>>>>>>> reference implementation there (Logistically complicated) >>>>>>>>> * Creating a Sub-Project of Apache Spark and moving the spec and >>>>>>>>> reference implementation there (Logistically complicated) >>>>>>>>> >>>>>>>>> 3. Add Variant type vs. Variant and JSON types >>>>>>>>> Those who were present were in favor of adding only the Variant >>>>>>>>> type to Iceberg. We are looking for anyone who has an objection to >>>>>>>>> going >>>>>>>>> forward with just the Variant Type and no Iceberg JSON Type. We were >>>>>>>>> favoring adding Variant type only because: >>>>>>>>> * Introducing a JSON type would require engines that only support >>>>>>>>> VARIANT to do write time validation of their input to a JSON column. >>>>>>>>> If >>>>>>>>> they don’t have a JSON type an engine wouldn’t support this. >>>>>>>>> * Engines which don’t support Variant will work most of the time >>>>>>>>> but can have fallback strings defined in the spec for reading >>>>>>>>> unsupported >>>>>>>>> types. Writing a JSON into a Variant will always work. >>>>>>>>> >>>>>>>>> 4. Support for Subcolumnization spec (shredding in Spark) >>>>>>>>> We have no action items on this but would like to follow up on >>>>>>>>> discussions on Subcolumnization in the future. >>>>>>>>> * We had general agreement that this should be included in Iceberg >>>>>>>>> V3 or else adding variant may not be useful. >>>>>>>>> * We are interested in also adopting the shredding spec from Spark >>>>>>>>> and would like to move it to whatever place we decided the Variant >>>>>>>>> spec is >>>>>>>>> going to live. >>>>>>>>> >>>>>>>>> Let us know if missed anything and if you have any additional >>>>>>>>> thoughts or suggestions. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Aihua >>>>>>>>> >>>>>>>>> >>>>>>>>> On 2024/07/15 18:32:22 Aihua Xu wrote: >>>>>>>>> > Thanks for the discussion. >>>>>>>>> > >>>>>>>>> > I will move forward to work on spec PR. >>>>>>>>> > >>>>>>>>> > Regarding the implementation, we will have module for Variant >>>>>>>>> support in Iceberg so we will not have to bring in Spark libraries. >>>>>>>>> > >>>>>>>>> > I'm reposting the meeting invite in case it's not clear in my >>>>>>>>> original email since I included in the end. Looks like we don't have >>>>>>>>> major >>>>>>>>> objections/diverges but let's sync up and have consensus. >>>>>>>>> > >>>>>>>>> > Meeting invite: >>>>>>>>> > >>>>>>>>> > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>> > Time zone: America/Los_Angeles >>>>>>>>> > Google Meet joining info >>>>>>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>>>>>> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>>>>>> > More phone numbers: >>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > Aihua >>>>>>>>> > >>>>>>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >>>>>>>>> > > I don't think this needs to hold up the PR but I think coming >>>>>>>>> to a >>>>>>>>> > > consensus on the exact set of types supported is worthwhile >>>>>>>>> (and if the >>>>>>>>> > > goal is to maintain the same set as specified by the Spark >>>>>>>>> Variant type or >>>>>>>>> > > if divergence is expected/allowed). From a fragmentation >>>>>>>>> perspective it >>>>>>>>> > > would be a shame if they diverge, so maybe a next step is also >>>>>>>>> suggesting >>>>>>>>> > > support to the Spark community on the missing existing Iceberg >>>>>>>>> types? >>>>>>>>> > > >>>>>>>>> > > Thanks, >>>>>>>>> > > Micah >>>>>>>>> > > >>>>>>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >>>>>>>>> russell.spit...@gmail.com> >>>>>>>>> > > wrote: >>>>>>>>> > > >>>>>>>>> > > > Just talked with Aihua and he's working on the Spec PR now. >>>>>>>>> We can get >>>>>>>>> > > > feedback there from everyone. >>>>>>>>> > > > >>>>>>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>> > > > wrote: >>>>>>>>> > > > >>>>>>>>> > > >> Good idea, but I'm hoping that we can continue to get their >>>>>>>>> feedback in >>>>>>>>> > > >> parallel to getting the spec changes started. Piotr didn't >>>>>>>>> seem to object >>>>>>>>> > > >> to the encoding from what I read of his comments. Hopefully >>>>>>>>> he (and others) >>>>>>>>> > > >> chime in here. >>>>>>>>> > > >> >>>>>>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >>>>>>>>> > > >> russell.spit...@gmail.com> wrote: >>>>>>>>> > > >> >>>>>>>>> > > >>> I just want to make sure we get Piotr and Peter on board as >>>>>>>>> > > >>> representatives of Flink and Trino engines. Also make sure >>>>>>>>> we have anyone >>>>>>>>> > > >>> else chime in who has experience with Ray if possible. >>>>>>>>> > > >>> >>>>>>>>> > > >>> Spec changes feel like the right next step. >>>>>>>>> > > >>> >>>>>>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>> > > >>> wrote: >>>>>>>>> > > >>> >>>>>>>>> > > >>>> Okay, what are the next steps here? This proposal has >>>>>>>>> been out for >>>>>>>>> > > >>>> quite a while and I don't see any major objections to >>>>>>>>> using the Spark >>>>>>>>> > > >>>> encoding. It's quite well designed and fits the need >>>>>>>>> well. It can also be >>>>>>>>> > > >>>> extended to support additional types that are missing if >>>>>>>>> that's a priority. >>>>>>>>> > > >>>> >>>>>>>>> > > >>>> Should we move forward by starting a draft of the changes >>>>>>>>> to the table >>>>>>>>> > > >>>> spec? Then we can vote on committing those changes and >>>>>>>>> get moving on an >>>>>>>>> > > >>>> implementation (or possibly do the implementation in >>>>>>>>> parallel). >>>>>>>>> > > >>>> >>>>>>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >>>>>>>>> > > >>>> russell.spit...@gmail.com> wrote: >>>>>>>>> > > >>>> >>>>>>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module. >>>>>>>>> > > >>>>> >>>>>>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >>>>>>>>> <b...@databricks.com.invalid> >>>>>>>>> > > >>>>> wrote: >>>>>>>>> > > >>>>> >>>>>>>>> > > >>>>>> > Feels like eventually the encoding should land in >>>>>>>>> parquet proper >>>>>>>>> > > >>>>>> right? >>>>>>>>> > > >>>>>> >>>>>>>>> > > >>>>>> What about using it in ORC? I don't know where it >>>>>>>>> should end up. >>>>>>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it? >>>>>>>>> > > >>>>>> >>>>>>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >>>>>>>>> > > >>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>> > > >>>>>> >>>>>>>>> > > >>>>>>> Feels like eventually the encoding should land in >>>>>>>>> parquet proper >>>>>>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg >>>>>>>>> though for the time >>>>>>>>> > > >>>>>>> being. >>>>>>>>> > > >>>>>>> >>>>>>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >>>>>>>>> > > >>>>>>> <b...@databricks.com.invalid> wrote: >>>>>>>>> > > >>>>>>> >>>>>>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this >>>>>>>>> up in his >>>>>>>>> > > >>>>>>>> last email: >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> > do we have an issue to directly use Spark >>>>>>>>> implementation in >>>>>>>>> > > >>>>>>>> Iceberg? >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark >>>>>>>>> library. What >>>>>>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg? >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> Ryan >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >>>>>>>>> b...@databricks.com> >>>>>>>>> > > >>>>>>>> wrote: >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>>> I raised the same point from Peter's email in a >>>>>>>>> comment on the doc >>>>>>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that >>>>>>>>> would be a much >>>>>>>>> > > >>>>>>>>> smaller scope than relying on large portions of >>>>>>>>> Spark, but I even then I >>>>>>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend >>>>>>>>> on that because it is a >>>>>>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton >>>>>>>>> of Scala libs. I think >>>>>>>>> > > >>>>>>>>> what makes the most sense is to have an independent >>>>>>>>> implementation of the >>>>>>>>> > > >>>>>>>>> spec in Iceberg. >>>>>>>>> > > >>>>>>>>> >>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>>>>>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>> > > >>>>>>>>> >>>>>>>>> > > >>>>>>>>>> Hi Aihua, >>>>>>>>> > > >>>>>>>>>> Long time no see :) >>>>>>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to >>>>>>>>> support Variant >>>>>>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like >>>>>>>>> Flink/Trino/Hive etc? >>>>>>>>> > > >>>>>>>>>> Thanks, Peter >>>>>>>>> > > >>>>>>>>>> >>>>>>>>> > > >>>>>>>>>> >>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu < >>>>>>>>> aihu...@apache.org> wrote: >>>>>>>>> > > >>>>>>>>>> >>>>>>>>> > > >>>>>>>>>>> Thanks Ryan. >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue >>>>>>>>> Spark encoding to >>>>>>>>> > > >>>>>>>>>>> keep compatibility for the open source engines. >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> > > >>>>>>>>>>> One more question regarding the encoding >>>>>>>>> implementation: do we >>>>>>>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation >>>>>>>>> in Iceberg? Russell >>>>>>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark >>>>>>>>> dependency and that could be a >>>>>>>>> > > >>>>>>>>>>> problem? >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> > > >>>>>>>>>>> Thanks, >>>>>>>>> > > >>>>>>>>>>> Aihua >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>>>>>> > > >>>>>>>>>>> > Thanks, Aihua! >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current >>>>>>>>> doc is a good >>>>>>>>> > > >>>>>>>>>>> one. I went >>>>>>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it >>>>>>>>> looks like a >>>>>>>>> > > >>>>>>>>>>> better choice than >>>>>>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly >>>>>>>>> accessing nested >>>>>>>>> > > >>>>>>>>>>> fields. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that >>>>>>>>> this is what >>>>>>>>> > > >>>>>>>>>>> Delta's variant >>>>>>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables >>>>>>>>> written by Delta >>>>>>>>> > > >>>>>>>>>>> could be >>>>>>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without >>>>>>>>> needing to rewrite >>>>>>>>> > > >>>>>>>>>>> variant >>>>>>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and >>>>>>>>> have an >>>>>>>>> > > >>>>>>>>>>> interest in >>>>>>>>> > > >>>>>>>>>>> > increasing format compatibility.) >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > Ryan >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >>>>>>>>> > > >>>>>>>>>>> > wrote: >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant >>>>>>>>> type proposal >>>>>>>>> > > >>>>>>>>>>> in the >>>>>>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to >>>>>>>>> host a meeting >>>>>>>>> > > >>>>>>>>>>> next week >>>>>>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any >>>>>>>>> further >>>>>>>>> > > >>>>>>>>>>> concerns about the >>>>>>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other >>>>>>>>> questions on the >>>>>>>>> > > >>>>>>>>>>> first phase of >>>>>>>>> > > >>>>>>>>>>> > > the proposal >>>>>>>>> > > >>>>>>>>>>> > > < >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>> > > >>>>>>>>>>> >. >>>>>>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in >>>>>>>>> the proposal >>>>>>>>> > > >>>>>>>>>>> can either join >>>>>>>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss >>>>>>>>> them. Summary >>>>>>>>> > > >>>>>>>>>>> of the >>>>>>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the >>>>>>>>> mailing list for >>>>>>>>> > > >>>>>>>>>>> further comment >>>>>>>>> > > >>>>>>>>>>> > > there. >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > What should be the underlying binary >>>>>>>>> representation >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc >>>>>>>>> including ION, >>>>>>>>> > > >>>>>>>>>>> JSONB, and >>>>>>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying >>>>>>>>> encoding is an >>>>>>>>> > > >>>>>>>>>>> important first step >>>>>>>>> > > >>>>>>>>>>> > > here and we believe we have general support >>>>>>>>> for Spark’s >>>>>>>>> > > >>>>>>>>>>> Variant encoding. >>>>>>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has >>>>>>>>> strong opinions in >>>>>>>>> > > >>>>>>>>>>> this space. >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Should we support multiple logical types or >>>>>>>>> just Variant? >>>>>>>>> > > >>>>>>>>>>> Variant vs. >>>>>>>>> > > >>>>>>>>>>> > > Variant + JSON. >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) >>>>>>>>> to be supported >>>>>>>>> > > >>>>>>>>>>> in Iceberg - >>>>>>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types >>>>>>>>> would share the >>>>>>>>> > > >>>>>>>>>>> same underlying >>>>>>>>> > > >>>>>>>>>>> > > encoding but would imply different limitations >>>>>>>>> on engines >>>>>>>>> > > >>>>>>>>>>> working with >>>>>>>>> > > >>>>>>>>>>> > > those types. >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring >>>>>>>>> toward >>>>>>>>> > > >>>>>>>>>>> supporting Variant >>>>>>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the >>>>>>>>> supported >>>>>>>>> > > >>>>>>>>>>> type(s). >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > - >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > How should we move forward with >>>>>>>>> Subcolumnization? >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for >>>>>>>>> Variant type by >>>>>>>>> > > >>>>>>>>>>> separating out >>>>>>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is >>>>>>>>> not critical for >>>>>>>>> > > >>>>>>>>>>> choosing the >>>>>>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we >>>>>>>>> were hoping to >>>>>>>>> > > >>>>>>>>>>> gain consensus on >>>>>>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec. >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Thanks >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Aihua >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Meeting invite: >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>>>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >>>>>>>>> > > >>>>>>>>>>> > > Google Meet joining info >>>>>>>>> > > >>>>>>>>>>> > > Video call link: >>>>>>>>> https://meet.google.com/pbm-ovzn-aoq >>>>>>>>> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 >>>>>>>>> 525# >>>>>>>>> > > >>>>>>>>>>> > > More phone numbers: >>>>>>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >>>>>>>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>>>> > >> Hello, >>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>> > > >>>>>>>>>>> > >> We have drafted the proposal >>>>>>>>> > > >>>>>>>>>>> > >> < >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and >>>>>>>>> comment. >>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>> > > >>>>>>>>>>> > >> Thanks, >>>>>>>>> > > >>>>>>>>>>> > >> Aihua >>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >>>>>>>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >>>>>>>>> > > >>>>>>>>>>> > >> >>>>>>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the >>>>>>>>> same >>>>>>>>> > > >>>>>>>>>>> discussion internally >>>>>>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with >>>>>>>>> for example >>>>>>>>> > > >>>>>>>>>>> the SUPER type in >>>>>>>>> > > >>>>>>>>>>> > >>> Redshift: >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>>>>>>> > > >>>>>>>>>>> and >>>>>>>>> > > >>>>>>>>>>> > >>> can also provide better integration with the >>>>>>>>> Trino JSON >>>>>>>>> > > >>>>>>>>>>> type. >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> > >>> Best, >>>>>>>>> > > >>>>>>>>>>> > >>> Jack Ye >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>>>>>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>>>>> > > >>>>>>>>>>> > >>> >>>>>>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < >>>>>>>>> ust...@gmail.com> >>>>>>>>> > > >>>>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how >>>>>>>>> many we need to >>>>>>>>> > > >>>>>>>>>>> look at; >>>>>>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but >>>>>>>>> weren't sure >>>>>>>>> > > >>>>>>>>>>> how much >>>>>>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed >>>>>>>>> to go。 >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the >>>>>>>>> Java world. It >>>>>>>>> > > >>>>>>>>>>> would be >>>>>>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the >>>>>>>>> effort it takes to >>>>>>>>> > > >>>>>>>>>>> integrate >>>>>>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, >>>>>>>>> datafusion, etc.). >>>>>>>>> > > >>>>>>>>>>> This is something >>>>>>>>> > > >>>>>>>>>>> > >>>>> that >>>>>>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care >>>>>>>>> about. >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share >>>>>>>>> some >>>>>>>>> > > >>>>>>>>>>> perspective on this. >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a >>>>>>>>> binary type >>>>>>>>> > > >>>>>>>>>>> and Iceberg and >>>>>>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the >>>>>>>>> binary column >>>>>>>>> > > >>>>>>>>>>> needs to be >>>>>>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be >>>>>>>>> sufficient. >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, >>>>>>>>> it would be >>>>>>>>> > > >>>>>>>>>>> good to support >>>>>>>>> > > >>>>>>>>>>> > >>>>> native >>>>>>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier >>>>>>>>> for projects >>>>>>>>> > > >>>>>>>>>>> like Apache >>>>>>>>> > > >>>>>>>>>>> > >>>>> XTable. >>>>>>>>> > > >>>>>>>>>>> > >>>>> File format could also provide >>>>>>>>> finer-grained statistics >>>>>>>>> > > >>>>>>>>>>> for variant >>>>>>>>> > > >>>>>>>>>>> > >>>>> type which >>>>>>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional >>>>>>>>> value in >>>>>>>>> > > >>>>>>>>>>> native file format >>>>>>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that >>>>>>>>> it's not a >>>>>>>>> > > >>>>>>>>>>> strict requirement. >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> -Tyler >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> Gang >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler >>>>>>>>> Akidau >>>>>>>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>>>> > >>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> -Tyler >>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM >>>>>>>>> Jean-Baptiste Onofré < >>>>>>>>> > > >>>>>>>>>>> j...@nanthrax.net> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> wrote: >>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It >>>>>>>>> reminds me our >>>>>>>>> > > >>>>>>>>>>> discussions back in >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty >>>>>>>>> interesting. I remember >>>>>>>>> > > >>>>>>>>>>> some discussions >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The >>>>>>>>> binary data type >>>>>>>>> > > >>>>>>>>>>> is already >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and >>>>>>>>> happy to help >>>>>>>>> > > >>>>>>>>>>> on this ! >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Regards >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> JB >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler >>>>>>>>> Akidau >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Hello, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are >>>>>>>>> working on a >>>>>>>>> > > >>>>>>>>>>> proposal for >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback >>>>>>>>> from the >>>>>>>>> > > >>>>>>>>>>> community. As you may know, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its >>>>>>>>> open Data Lake >>>>>>>>> > > >>>>>>>>>>> format. Having made >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the >>>>>>>>> Iceberg >>>>>>>>> > > >>>>>>>>>>> standard, we’re now in a >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not >>>>>>>>> yet supported in >>>>>>>>> > > >>>>>>>>>>> Iceberg which we >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, >>>>>>>>> and that we >>>>>>>>> > > >>>>>>>>>>> would like to discuss >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg >>>>>>>>> community. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like >>>>>>>>> to discuss are >>>>>>>>> > > >>>>>>>>>>> in support of >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, >>>>>>>>> > > >>>>>>>>>>> semi-structured data: variant data >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant >>>>>>>>> columns. In >>>>>>>>> > > >>>>>>>>>>> more detail, for >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient >>>>>>>>> binary >>>>>>>>> > > >>>>>>>>>>> encoding of dynamic >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, >>>>>>>>> etc. By >>>>>>>>> > > >>>>>>>>>>> encoding semi-structured >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the >>>>>>>>> flexibility of >>>>>>>>> > > >>>>>>>>>>> the source data, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more >>>>>>>>> efficiently >>>>>>>>> > > >>>>>>>>>>> operate on the data. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data >>>>>>>>> type on >>>>>>>>> > > >>>>>>>>>>> Snowflake tables for many >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users >>>>>>>>> utilize Iceberg >>>>>>>>> > > >>>>>>>>>>> tables in Snowflake, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of >>>>>>>>> requests for >>>>>>>>> > > >>>>>>>>>>> variant support. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such >>>>>>>>> as Apache Spark >>>>>>>>> > > >>>>>>>>>>> have begun adding >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe >>>>>>>>> it would be >>>>>>>>> > > >>>>>>>>>>> beneficial to the >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to >>>>>>>>> standardize on the >>>>>>>>> > > >>>>>>>>>>> variant data type >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is >>>>>>>>> that, since an >>>>>>>>> > > >>>>>>>>>>> Apache OSS >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already >>>>>>>>> exists in Spark, >>>>>>>>> > > >>>>>>>>>>> it likely makes sense >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as >>>>>>>>> the Iceberg >>>>>>>>> > > >>>>>>>>>>> standard as well. The >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in >>>>>>>>> Snowflake is >>>>>>>>> > > >>>>>>>>>>> slightly different, but >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no >>>>>>>>> particular value >>>>>>>>> > > >>>>>>>>>>> in trying to clutter >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> the space with another >>>>>>>>> equivalent-but-incompatible >>>>>>>>> > > >>>>>>>>>>> encoding. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns >>>>>>>>> allows query >>>>>>>>> > > >>>>>>>>>>> engines to >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when >>>>>>>>> subcolumns (i.e., >>>>>>>>> > > >>>>>>>>>>> nested fields) within a >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also >>>>>>>>> allows optionally >>>>>>>>> > > >>>>>>>>>>> materializing some >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on >>>>>>>>> their own, >>>>>>>>> > > >>>>>>>>>>> affording queries on these >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data >>>>>>>>> and spend >>>>>>>>> > > >>>>>>>>>>> less CPU on extraction. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system >>>>>>>>> managing table >>>>>>>>> > > >>>>>>>>>>> metadata and data tracks >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, >>>>>>>>> null, etc.) >>>>>>>>> > > >>>>>>>>>>> for some subset of the >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also >>>>>>>>> manages any >>>>>>>>> > > >>>>>>>>>>> optional >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without >>>>>>>>> subcolumnarization, any query >>>>>>>>> > > >>>>>>>>>>> which touches a >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, >>>>>>>>> extract, and filter >>>>>>>>> > > >>>>>>>>>>> every row for which >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by >>>>>>>>> providing a >>>>>>>>> > > >>>>>>>>>>> standardized way of tracking >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant >>>>>>>>> columns, >>>>>>>>> > > >>>>>>>>>>> Iceberg can make >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible >>>>>>>>> across various >>>>>>>>> > > >>>>>>>>>>> catalogs and query >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> engines. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial >>>>>>>>> topic, so we >>>>>>>>> > > >>>>>>>>>>> expect any >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only >>>>>>>>> the set of >>>>>>>>> > > >>>>>>>>>>> changes to Iceberg >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query >>>>>>>>> engines to >>>>>>>>> > > >>>>>>>>>>> interopate on >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant >>>>>>>>> columns, but also >>>>>>>>> > > >>>>>>>>>>> reference >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining >>>>>>>>> subcolumnarization principles >>>>>>>>> > > >>>>>>>>>>> and recommended best >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> practices. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal >>>>>>>>> [3] may be a >>>>>>>>> > > >>>>>>>>>>> good starting >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our >>>>>>>>> plan is to >>>>>>>>> > > >>>>>>>>>>> write something up in >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec >>>>>>>>> changes, >>>>>>>>> > > >>>>>>>>>>> backwards compatibility, >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted >>>>>>>>> to first reach >>>>>>>>> > > >>>>>>>>>>> out to the community >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and >>>>>>>>> see if >>>>>>>>> > > >>>>>>>>>>> there’s any early feedback >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend >>>>>>>>> too much time on >>>>>>>>> > > >>>>>>>>>>> a concrete proposal. >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you! >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [1] >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [2] >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > [3] >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > >>>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>> >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> > -- >>>>>>>>> > > >>>>>>>>>>> > Ryan Blue >>>>>>>>> > > >>>>>>>>>>> > Databricks >>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>> > > >>>>>>>>>>> >>>>>>>>> > > >>>>>>>>>> >>>>>>>>> > > >>>>>>>>> >>>>>>>>> > > >>>>>>>>> -- >>>>>>>>> > > >>>>>>>>> Ryan Blue >>>>>>>>> > > >>>>>>>>> Databricks >>>>>>>>> > > >>>>>>>>> >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>>> -- >>>>>>>>> > > >>>>>>>> Ryan Blue >>>>>>>>> > > >>>>>>>> Databricks >>>>>>>>> > > >>>>>>>> >>>>>>>>> > > >>>>>>> >>>>>>>>> > > >>>>>> >>>>>>>>> > > >>>>>> -- >>>>>>>>> > > >>>>>> Ryan Blue >>>>>>>>> > > >>>>>> Databricks >>>>>>>>> > > >>>>>> >>>>>>>>> > > >>>>> >>>>>>>>> > > >>>> >>>>>>>>> > > >>>> -- >>>>>>>>> > > >>>> Ryan Blue >>>>>>>>> > > >>>> Databricks >>>>>>>>> > > >>>> >>>>>>>>> > > >>> >>>>>>>>> > > >> >>>>>>>>> > > >> -- >>>>>>>>> > > >> Ryan Blue >>>>>>>>> > > >> Databricks >>>>>>>>> > > >> >>>>>>>>> > > > >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Databricks >>>>>>> >>>>>>