Thanks for the discussion and feedback. Do we have the consensus on point 1 and point 3 to move forward with Spark variant encoding and support Variant type only? Or let me know how to proceed from here.
Regarding point 2, I also feel Iceberg is more natural to host such a subproject for variant spec and implementation. But let me reach out to the Spark community to discuss. Thanks, Aihua On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <flyrain...@gmail.com> wrote: > Agreed with point 1. > > For point 2, I also prefer to hold the spec and reference implementation > under Iceberg. Here are the reasons: > 1. It is unconventional and impractical for one engine to depend on > another for data types. For instance, it is not ideal for Trino to rely on > data types defined by the Spark engine. > 2. Iceberg serves as a bridge between engines and file formats. By > centralizing the specification in Iceberg, any future optimizations or > updates to file formats can be referred to within Iceberg, ensuring > consistency and reducing dependencies. > > For point 3, I'd prefer to support the variant type only at this moment. > > Yufei > > > On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue <b...@databricks.com.invalid> > wrote: > >> Similarly, I'm aligned with point 1 and I'd choose to support only >> variant for point 3. >> >> We'll need to work with the Spark community to find a good place for the >> library and spec, since it touches many different projects. I'd also prefer >> Iceberg as the home. >> >> I also think it's a good idea to get subcolumnarization into our spec >> when we update. Without that I think the feature will be fairly limited. >> >> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> I'm aligned with point 1. >>> >>> For point 2 I think we should choose quickly, I honestly do think this >>> would be fine as part of the Iceberg Spec directly but understand it may be >>> better for the more broad community if it was a sub project. As a >>> sub-project I would still prefer it being an Iceberg Subproject since we >>> are engine/file-format agnostic. >>> >>> 3. I support adding just Variant. >>> >>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote: >>> >>>> Hello community, >>>> >>>> It’s great to sync up with some of you on Variant and SubColumarization >>>> support in Iceberg again. Apologize that I didn’t record the meeting but >>>> here are some key items that we want to follow up with the community. >>>> >>>> 1. Adopt Spark Variant encoding >>>> Those present were in favor of adopting the Spark variant encoding for >>>> Iceberg Variant with extensions to support other Iceberg types. We would >>>> like to know if anyone has an objection to this to reuse an open source >>>> encoding. >>>> >>>> 2. Movement of the Spark Variant Spec to another project >>>> To avoid introducing Apache Spark as a dependency for the engines and >>>> file formats, we discussed separating Spark Variant encoding spec and >>>> implementation from the Spark Project to a neutral location. We thought up >>>> several solutions but didn’t have consensus on any of them. We are looking >>>> for more feedback on this topic from the community either in terms of >>>> support for one of these options or another idea on how to support the >>>> spec. >>>> >>>> Options Proposed: >>>> * Leave the Spec in Spark (Difficult for versioning and other engines) >>>> * Copying the Spec into Iceberg Project Directly (Difficult for other >>>> Table Formats) >>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and >>>> reference implementation there (Logistically complicated) >>>> * Creating a Sub-Project of Apache Spark and moving the spec and >>>> reference implementation there (Logistically complicated) >>>> >>>> 3. Add Variant type vs. Variant and JSON types >>>> Those who were present were in favor of adding only the Variant type to >>>> Iceberg. We are looking for anyone who has an objection to going forward >>>> with just the Variant Type and no Iceberg JSON Type. We were favoring >>>> adding Variant type only because: >>>> * Introducing a JSON type would require engines that only support >>>> VARIANT to do write time validation of their input to a JSON column. If >>>> they don’t have a JSON type an engine wouldn’t support this. >>>> * Engines which don’t support Variant will work most of the time but >>>> can have fallback strings defined in the spec for reading unsupported >>>> types. Writing a JSON into a Variant will always work. >>>> >>>> 4. Support for Subcolumnization spec (shredding in Spark) >>>> We have no action items on this but would like to follow up on >>>> discussions on Subcolumnization in the future. >>>> * We had general agreement that this should be included in Iceberg V3 >>>> or else adding variant may not be useful. >>>> * We are interested in also adopting the shredding spec from Spark and >>>> would like to move it to whatever place we decided the Variant spec is >>>> going to live. >>>> >>>> Let us know if missed anything and if you have any additional thoughts >>>> or suggestions. >>>> >>>> Thanks >>>> Aihua >>>> >>>> >>>> On 2024/07/15 18:32:22 Aihua Xu wrote: >>>> > Thanks for the discussion. >>>> > >>>> > I will move forward to work on spec PR. >>>> > >>>> > Regarding the implementation, we will have module for Variant support >>>> in Iceberg so we will not have to bring in Spark libraries. >>>> > >>>> > I'm reposting the meeting invite in case it's not clear in my >>>> original email since I included in the end. Looks like we don't have major >>>> objections/diverges but let's sync up and have consensus. >>>> > >>>> > Meeting invite: >>>> > >>>> > Wednesday, July 17 · 9:00 – 10:00am >>>> > Time zone: America/Los_Angeles >>>> > Google Meet joining info >>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>> > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>> > >>>> > Thanks, >>>> > Aihua >>>> > >>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote: >>>> > > I don't think this needs to hold up the PR but I think coming to a >>>> > > consensus on the exact set of types supported is worthwhile (and if >>>> the >>>> > > goal is to maintain the same set as specified by the Spark Variant >>>> type or >>>> > > if divergence is expected/allowed). From a fragmentation >>>> perspective it >>>> > > would be a shame if they diverge, so maybe a next step is also >>>> suggesting >>>> > > support to the Spark community on the missing existing Iceberg >>>> types? >>>> > > >>>> > > Thanks, >>>> > > Micah >>>> > > >>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < >>>> russell.spit...@gmail.com> >>>> > > wrote: >>>> > > >>>> > > > Just talked with Aihua and he's working on the Spec PR now. We >>>> can get >>>> > > > feedback there from everyone. >>>> > > > >>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue >>>> <b...@databricks.com.invalid> >>>> > > > wrote: >>>> > > > >>>> > > >> Good idea, but I'm hoping that we can continue to get their >>>> feedback in >>>> > > >> parallel to getting the spec changes started. Piotr didn't seem >>>> to object >>>> > > >> to the encoding from what I read of his comments. Hopefully he >>>> (and others) >>>> > > >> chime in here. >>>> > > >> >>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < >>>> > > >> russell.spit...@gmail.com> wrote: >>>> > > >> >>>> > > >>> I just want to make sure we get Piotr and Peter on board as >>>> > > >>> representatives of Flink and Trino engines. Also make sure we >>>> have anyone >>>> > > >>> else chime in who has experience with Ray if possible. >>>> > > >>> >>>> > > >>> Spec changes feel like the right next step. >>>> > > >>> >>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue >>>> <b...@databricks.com.invalid> >>>> > > >>> wrote: >>>> > > >>> >>>> > > >>>> Okay, what are the next steps here? This proposal has been out >>>> for >>>> > > >>>> quite a while and I don't see any major objections to using >>>> the Spark >>>> > > >>>> encoding. It's quite well designed and fits the need well. It >>>> can also be >>>> > > >>>> extended to support additional types that are missing if >>>> that's a priority. >>>> > > >>>> >>>> > > >>>> Should we move forward by starting a draft of the changes to >>>> the table >>>> > > >>>> spec? Then we can vote on committing those changes and get >>>> moving on an >>>> > > >>>> implementation (or possibly do the implementation in parallel). >>>> > > >>>> >>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < >>>> > > >>>> russell.spit...@gmail.com> wrote: >>>> > > >>>> >>>> > > >>>>> That's fair, I'm sold on an Iceberg Module. >>>> > > >>>>> >>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue >>>> <b...@databricks.com.invalid> >>>> > > >>>>> wrote: >>>> > > >>>>> >>>> > > >>>>>> > Feels like eventually the encoding should land in parquet >>>> proper >>>> > > >>>>>> right? >>>> > > >>>>>> >>>> > > >>>>>> What about using it in ORC? I don't know where it should end >>>> up. >>>> > > >>>>>> Maybe Iceberg should make a standalone module from it? >>>> > > >>>>>> >>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < >>>> > > >>>>>> russell.spit...@gmail.com> wrote: >>>> > > >>>>>> >>>> > > >>>>>>> Feels like eventually the encoding should land in parquet >>>> proper >>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg though >>>> for the time >>>> > > >>>>>>> being. >>>> > > >>>>>>> >>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue >>>> > > >>>>>>> <b...@databricks.com.invalid> wrote: >>>> > > >>>>>>> >>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up >>>> in his >>>> > > >>>>>>>> last email: >>>> > > >>>>>>>> >>>> > > >>>>>>>> > do we have an issue to directly use Spark implementation >>>> in >>>> > > >>>>>>>> Iceberg? >>>> > > >>>>>>>> >>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark >>>> library. What >>>> > > >>>>>>>> do you think about a Java implementation in Iceberg? >>>> > > >>>>>>>> >>>> > > >>>>>>>> Ryan >>>> > > >>>>>>>> >>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < >>>> b...@databricks.com> >>>> > > >>>>>>>> wrote: >>>> > > >>>>>>>> >>>> > > >>>>>>>>> I raised the same point from Peter's email in a comment >>>> on the doc >>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that >>>> would be a much >>>> > > >>>>>>>>> smaller scope than relying on large portions of Spark, >>>> but I even then I >>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on >>>> that because it is a >>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of >>>> Scala libs. I think >>>> > > >>>>>>>>> what makes the most sense is to have an independent >>>> implementation of the >>>> > > >>>>>>>>> spec in Iceberg. >>>> > > >>>>>>>>> >>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>> > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>> > > >>>>>>>>> >>>> > > >>>>>>>>>> Hi Aihua, >>>> > > >>>>>>>>>> Long time no see :) >>>> > > >>>>>>>>>> Would this mean, that every engine which plans to >>>> support Variant >>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like >>>> Flink/Trino/Hive etc? >>>> > > >>>>>>>>>> Thanks, Peter >>>> > > >>>>>>>>>> >>>> > > >>>>>>>>>> >>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> >>>> wrote: >>>> > > >>>>>>>>>> >>>> > > >>>>>>>>>>> Thanks Ryan. >>>> > > >>>>>>>>>>> >>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark >>>> encoding to >>>> > > >>>>>>>>>>> keep compatibility for the open source engines. >>>> > > >>>>>>>>>>> >>>> > > >>>>>>>>>>> One more question regarding the encoding >>>> implementation: do we >>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation in >>>> Iceberg? Russell >>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency >>>> and that could be a >>>> > > >>>>>>>>>>> problem? >>>> > > >>>>>>>>>>> >>>> > > >>>>>>>>>>> Thanks, >>>> > > >>>>>>>>>>> Aihua >>>> > > >>>>>>>>>>> >>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>> > > >>>>>>>>>>> > Thanks, Aihua! >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > I think that the encoding choice in the current doc >>>> is a good >>>> > > >>>>>>>>>>> one. I went >>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks >>>> like a >>>> > > >>>>>>>>>>> better choice than >>>> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing >>>> nested >>>> > > >>>>>>>>>>> fields. >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is >>>> what >>>> > > >>>>>>>>>>> Delta's variant >>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written >>>> by Delta >>>> > > >>>>>>>>>>> could be >>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing >>>> to rewrite >>>> > > >>>>>>>>>>> variant >>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have >>>> an >>>> > > >>>>>>>>>>> interest in >>>> > > >>>>>>>>>>> > increasing format compatibility.) >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > Ryan >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < >>>> > > >>>>>>>>>>> aihua...@snowflake.com.invalid> >>>> > > >>>>>>>>>>> > wrote: >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type >>>> proposal >>>> > > >>>>>>>>>>> in the >>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a >>>> meeting >>>> > > >>>>>>>>>>> next week >>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any >>>> further >>>> > > >>>>>>>>>>> concerns about the >>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other >>>> questions on the >>>> > > >>>>>>>>>>> first phase of >>>> > > >>>>>>>>>>> > > the proposal >>>> > > >>>>>>>>>>> > > < >>>> > > >>>>>>>>>>> >>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>> > > >>>>>>>>>>> >. >>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the >>>> proposal >>>> > > >>>>>>>>>>> can either join >>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss >>>> them. Summary >>>> > > >>>>>>>>>>> of the >>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing >>>> list for >>>> > > >>>>>>>>>>> further comment >>>> > > >>>>>>>>>>> > > there. >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > - >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > What should be the underlying binary >>>> representation >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc >>>> including ION, >>>> > > >>>>>>>>>>> JSONB, and >>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is >>>> an >>>> > > >>>>>>>>>>> important first step >>>> > > >>>>>>>>>>> > > here and we believe we have general support for >>>> Spark’s >>>> > > >>>>>>>>>>> Variant encoding. >>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong >>>> opinions in >>>> > > >>>>>>>>>>> this space. >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > - >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Should we support multiple logical types or just >>>> Variant? >>>> > > >>>>>>>>>>> Variant vs. >>>> > > >>>>>>>>>>> > > Variant + JSON. >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be >>>> supported >>>> > > >>>>>>>>>>> in Iceberg - >>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would >>>> share the >>>> > > >>>>>>>>>>> same underlying >>>> > > >>>>>>>>>>> > > encoding but would imply different limitations on >>>> engines >>>> > > >>>>>>>>>>> working with >>>> > > >>>>>>>>>>> > > those types. >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring >>>> toward >>>> > > >>>>>>>>>>> supporting Variant >>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the >>>> supported >>>> > > >>>>>>>>>>> type(s). >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > - >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > How should we move forward with Subcolumnization? >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant >>>> type by >>>> > > >>>>>>>>>>> separating out >>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not >>>> critical for >>>> > > >>>>>>>>>>> choosing the >>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were >>>> hoping to >>>> > > >>>>>>>>>>> gain consensus on >>>> > > >>>>>>>>>>> > > leaving that for a follow up spec. >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Thanks >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Aihua >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Meeting invite: >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles >>>> > > >>>>>>>>>>> > > Google Meet joining info >>>> > > >>>>>>>>>>> > > Video call link: >>>> https://meet.google.com/pbm-ovzn-aoq >>>> > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>> > > >>>>>>>>>>> > > More phone numbers: >>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < >>>> > > >>>>>>>>>>> aihua...@snowflake.com> wrote: >>>> > > >>>>>>>>>>> > > >>>> > > >>>>>>>>>>> > >> Hello, >>>> > > >>>>>>>>>>> > >> >>>> > > >>>>>>>>>>> > >> We have drafted the proposal >>>> > > >>>>>>>>>>> > >> < >>>> > > >>>>>>>>>>> >>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and >>>> comment. >>>> > > >>>>>>>>>>> > >> >>>> > > >>>>>>>>>>> > >> Thanks, >>>> > > >>>>>>>>>>> > >> Aihua >>>> > > >>>>>>>>>>> > >> >>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < >>>> > > >>>>>>>>>>> yezhao...@gmail.com> wrote: >>>> > > >>>>>>>>>>> > >> >>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same >>>> > > >>>>>>>>>>> discussion internally >>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for >>>> example >>>> > > >>>>>>>>>>> the SUPER type in >>>> > > >>>>>>>>>>> > >>> Redshift: >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> >>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>>> > > >>>>>>>>>>> and >>>> > > >>>>>>>>>>> > >>> can also provide better integration with the >>>> Trino JSON >>>> > > >>>>>>>>>>> type. >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal! >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> > >>> Best, >>>> > > >>>>>>>>>>> > >>> Jack Ye >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>> > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > > >>>>>>>>>>> > >>> >>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < >>>> ust...@gmail.com> >>>> > > >>>>>>>>>>> wrote: >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we >>>> need to >>>> > > >>>>>>>>>>> look at; >>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but >>>> weren't sure >>>> > > >>>>>>>>>>> how much >>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。 >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java >>>> world. It >>>> > > >>>>>>>>>>> would be >>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it >>>> takes to >>>> > > >>>>>>>>>>> integrate >>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, >>>> etc.). >>>> > > >>>>>>>>>>> This is something >>>> > > >>>>>>>>>>> > >>>>> that >>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care >>>> about. >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some >>>> > > >>>>>>>>>>> perspective on this. >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a >>>> binary type >>>> > > >>>>>>>>>>> and Iceberg and >>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary >>>> column >>>> > > >>>>>>>>>>> needs to be >>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be >>>> sufficient. >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it >>>> would be >>>> > > >>>>>>>>>>> good to support >>>> > > >>>>>>>>>>> > >>>>> native >>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for >>>> projects >>>> > > >>>>>>>>>>> like Apache >>>> > > >>>>>>>>>>> > >>>>> XTable. >>>> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained >>>> statistics >>>> > > >>>>>>>>>>> for variant >>>> > > >>>>>>>>>>> > >>>>> type which >>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping. >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value >>>> in >>>> > > >>>>>>>>>>> native file format >>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's >>>> not a >>>> > > >>>>>>>>>>> strict requirement. >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>> -Tyler >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>> >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>>> Gang >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>> > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > > >>>>>>>>>>> > >>>>> >>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>> > > >>>>>>>>>>> > >>>>>> >>>> > > >>>>>>>>>>> > >>>>>> -Tyler >>>> > > >>>>>>>>>>> > >>>>>> >>>> > > >>>>>>>>>>> > >>>>>> >>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste >>>> Onofré < >>>> > > >>>>>>>>>>> j...@nanthrax.net> >>>> > > >>>>>>>>>>> > >>>>>> wrote: >>>> > > >>>>>>>>>>> > >>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler, >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me >>>> our >>>> > > >>>>>>>>>>> discussions back in >>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I >>>> remember >>>> > > >>>>>>>>>>> some discussions >>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary >>>> data type >>>> > > >>>>>>>>>>> is already >>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy >>>> to help >>>> > > >>>>>>>>>>> on this ! >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> Regards >>>> > > >>>>>>>>>>> > >>>>>>> JB >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>> > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > Hello, >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are >>>> working on a >>>> > > >>>>>>>>>>> proposal for >>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the >>>> > > >>>>>>>>>>> community. As you may know, >>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open >>>> Data Lake >>>> > > >>>>>>>>>>> format. Having made >>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the >>>> Iceberg >>>> > > >>>>>>>>>>> standard, we’re now in a >>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet >>>> supported in >>>> > > >>>>>>>>>>> Iceberg which we >>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and >>>> that we >>>> > > >>>>>>>>>>> would like to discuss >>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg >>>> community. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to >>>> discuss are >>>> > > >>>>>>>>>>> in support of >>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, >>>> > > >>>>>>>>>>> semi-structured data: variant data >>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant >>>> columns. In >>>> > > >>>>>>>>>>> more detail, for >>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types >>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary >>>> > > >>>>>>>>>>> encoding of dynamic >>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. >>>> By >>>> > > >>>>>>>>>>> encoding semi-structured >>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the >>>> flexibility of >>>> > > >>>>>>>>>>> the source data, >>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more >>>> efficiently >>>> > > >>>>>>>>>>> operate on the data. >>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type >>>> on >>>> > > >>>>>>>>>>> Snowflake tables for many >>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize >>>> Iceberg >>>> > > >>>>>>>>>>> tables in Snowflake, >>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of >>>> requests for >>>> > > >>>>>>>>>>> variant support. >>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as >>>> Apache Spark >>>> > > >>>>>>>>>>> have begun adding >>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it >>>> would be >>>> > > >>>>>>>>>>> beneficial to the >>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize >>>> on the >>>> > > >>>>>>>>>>> variant data type >>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that, >>>> since an >>>> > > >>>>>>>>>>> Apache OSS >>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in >>>> Spark, >>>> > > >>>>>>>>>>> it likely makes sense >>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the >>>> Iceberg >>>> > > >>>>>>>>>>> standard as well. The >>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake >>>> is >>>> > > >>>>>>>>>>> slightly different, but >>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no >>>> particular value >>>> > > >>>>>>>>>>> in trying to clutter >>>> > > >>>>>>>>>>> > >>>>>>> the space with another >>>> equivalent-but-incompatible >>>> > > >>>>>>>>>>> encoding. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization >>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns >>>> allows query >>>> > > >>>>>>>>>>> engines to >>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns >>>> (i.e., >>>> > > >>>>>>>>>>> nested fields) within a >>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows >>>> optionally >>>> > > >>>>>>>>>>> materializing some >>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own, >>>> > > >>>>>>>>>>> affording queries on these >>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and >>>> spend >>>> > > >>>>>>>>>>> less CPU on extraction. >>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing >>>> table >>>> > > >>>>>>>>>>> metadata and data tracks >>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, >>>> null, etc.) >>>> > > >>>>>>>>>>> for some subset of the >>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also >>>> manages any >>>> > > >>>>>>>>>>> optional >>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization, >>>> any query >>>> > > >>>>>>>>>>> which touches a >>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and >>>> filter >>>> > > >>>>>>>>>>> every row for which >>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a >>>> > > >>>>>>>>>>> standardized way of tracking >>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant >>>> columns, >>>> > > >>>>>>>>>>> Iceberg can make >>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across >>>> various >>>> > > >>>>>>>>>>> catalogs and query >>>> > > >>>>>>>>>>> > >>>>>>> engines. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, >>>> so we >>>> > > >>>>>>>>>>> expect any >>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set >>>> of >>>> > > >>>>>>>>>>> changes to Iceberg >>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines >>>> to >>>> > > >>>>>>>>>>> interopate on >>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns, >>>> but also >>>> > > >>>>>>>>>>> reference >>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization >>>> principles >>>> > > >>>>>>>>>>> and recommended best >>>> > > >>>>>>>>>>> > >>>>>>> practices. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] >>>> may be a >>>> > > >>>>>>>>>>> good starting >>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan >>>> is to >>>> > > >>>>>>>>>>> write something up in >>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec >>>> changes, >>>> > > >>>>>>>>>>> backwards compatibility, >>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to >>>> first reach >>>> > > >>>>>>>>>>> out to the community >>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see >>>> if >>>> > > >>>>>>>>>>> there’s any early feedback >>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too >>>> much time on >>>> > > >>>>>>>>>>> a concrete proposal. >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > Thank you! >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > [1] >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> >>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>> > > >>>>>>>>>>> > >>>>>>> > [2] >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> >>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>> > > >>>>>>>>>>> > >>>>>>> > [3] >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> >>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>> > > >>>>>>>>>>> > >>>>>>> > >>>> > > >>>>>>>>>>> > >>>>>>> >>>> > > >>>>>>>>>>> > >>>>>> >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> > -- >>>> > > >>>>>>>>>>> > Ryan Blue >>>> > > >>>>>>>>>>> > Databricks >>>> > > >>>>>>>>>>> > >>>> > > >>>>>>>>>>> >>>> > > >>>>>>>>>> >>>> > > >>>>>>>>> >>>> > > >>>>>>>>> -- >>>> > > >>>>>>>>> Ryan Blue >>>> > > >>>>>>>>> Databricks >>>> > > >>>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> >>>> > > >>>>>>>> -- >>>> > > >>>>>>>> Ryan Blue >>>> > > >>>>>>>> Databricks >>>> > > >>>>>>>> >>>> > > >>>>>>> >>>> > > >>>>>> >>>> > > >>>>>> -- >>>> > > >>>>>> Ryan Blue >>>> > > >>>>>> Databricks >>>> > > >>>>>> >>>> > > >>>>> >>>> > > >>>> >>>> > > >>>> -- >>>> > > >>>> Ryan Blue >>>> > > >>>> Databricks >>>> > > >>>> >>>> > > >>> >>>> > > >> >>>> > > >> -- >>>> > > >> Ryan Blue >>>> > > >> Databricks >>>> > > >> >>>> > > > >>>> > > >>>> > >>>> >>> >> >> -- >> Ryan Blue >> Databricks >> >