I'm aligned with point 1. For point 2 I think we should choose quickly, I honestly do think this would be fine as part of the Iceberg Spec directly but understand it may be better for the more broad community if it was a sub project. As a sub-project I would still prefer it being an Iceberg Subproject since we are engine/file-format agnostic.
3. I support adding just Variant. On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <aihu...@apache.org> wrote: > Hello community, > > It’s great to sync up with some of you on Variant and SubColumarization > support in Iceberg again. Apologize that I didn’t record the meeting but > here are some key items that we want to follow up with the community. > > 1. Adopt Spark Variant encoding > Those present were in favor of adopting the Spark variant encoding for > Iceberg Variant with extensions to support other Iceberg types. We would > like to know if anyone has an objection to this to reuse an open source > encoding. > > 2. Movement of the Spark Variant Spec to another project > To avoid introducing Apache Spark as a dependency for the engines and file > formats, we discussed separating Spark Variant encoding spec and > implementation from the Spark Project to a neutral location. We thought up > several solutions but didn’t have consensus on any of them. We are looking > for more feedback on this topic from the community either in terms of > support for one of these options or another idea on how to support the spec. > > Options Proposed: > * Leave the Spec in Spark (Difficult for versioning and other engines) > * Copying the Spec into Iceberg Project Directly (Difficult for other > Table Formats) > * Creating a Sub-Project of Apache Iceberg and moving the spec and > reference implementation there (Logistically complicated) > * Creating a Sub-Project of Apache Spark and moving the spec and reference > implementation there (Logistically complicated) > > 3. Add Variant type vs. Variant and JSON types > Those who were present were in favor of adding only the Variant type to > Iceberg. We are looking for anyone who has an objection to going forward > with just the Variant Type and no Iceberg JSON Type. We were favoring > adding Variant type only because: > * Introducing a JSON type would require engines that only support VARIANT > to do write time validation of their input to a JSON column. If they don’t > have a JSON type an engine wouldn’t support this. > * Engines which don’t support Variant will work most of the time but can > have fallback strings defined in the spec for reading unsupported types. > Writing a JSON into a Variant will always work. > > 4. Support for Subcolumnization spec (shredding in Spark) > We have no action items on this but would like to follow up on discussions > on Subcolumnization in the future. > * We had general agreement that this should be included in Iceberg V3 or > else adding variant may not be useful. > * We are interested in also adopting the shredding spec from Spark and > would like to move it to whatever place we decided the Variant spec is > going to live. > > Let us know if missed anything and if you have any additional thoughts or > suggestions. > > Thanks > Aihua > > > On 2024/07/15 18:32:22 Aihua Xu wrote: > > Thanks for the discussion. > > > > I will move forward to work on spec PR. > > > > Regarding the implementation, we will have module for Variant support in > Iceberg so we will not have to bring in Spark libraries. > > > > I'm reposting the meeting invite in case it's not clear in my original > email since I included in the end. Looks like we don't have major > objections/diverges but let's sync up and have consensus. > > > > Meeting invite: > > > > Wednesday, July 17 · 9:00 – 10:00am > > Time zone: America/Los_Angeles > > Google Meet joining info > > Video call link: https://meet.google.com/pbm-ovzn-aoq > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 > > > > Thanks, > > Aihua > > > > On 2024/07/12 20:55:01 Micah Kornfield wrote: > > > I don't think this needs to hold up the PR but I think coming to a > > > consensus on the exact set of types supported is worthwhile (and if the > > > goal is to maintain the same set as specified by the Spark Variant > type or > > > if divergence is expected/allowed). From a fragmentation perspective > it > > > would be a shame if they diverge, so maybe a next step is also > suggesting > > > support to the Spark community on the missing existing Iceberg types? > > > > > > Thanks, > > > Micah > > > > > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer < > russell.spit...@gmail.com> > > > wrote: > > > > > > > Just talked with Aihua and he's working on the Spec PR now. We can > get > > > > feedback there from everyone. > > > > > > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue > <b...@databricks.com.invalid> > > > > wrote: > > > > > > > >> Good idea, but I'm hoping that we can continue to get their > feedback in > > > >> parallel to getting the spec changes started. Piotr didn't seem to > object > > > >> to the encoding from what I read of his comments. Hopefully he (and > others) > > > >> chime in here. > > > >> > > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer < > > > >> russell.spit...@gmail.com> wrote: > > > >> > > > >>> I just want to make sure we get Piotr and Peter on board as > > > >>> representatives of Flink and Trino engines. Also make sure we have > anyone > > > >>> else chime in who has experience with Ray if possible. > > > >>> > > > >>> Spec changes feel like the right next step. > > > >>> > > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue > <b...@databricks.com.invalid> > > > >>> wrote: > > > >>> > > > >>>> Okay, what are the next steps here? This proposal has been out for > > > >>>> quite a while and I don't see any major objections to using the > Spark > > > >>>> encoding. It's quite well designed and fits the need well. It can > also be > > > >>>> extended to support additional types that are missing if that's a > priority. > > > >>>> > > > >>>> Should we move forward by starting a draft of the changes to the > table > > > >>>> spec? Then we can vote on committing those changes and get moving > on an > > > >>>> implementation (or possibly do the implementation in parallel). > > > >>>> > > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer < > > > >>>> russell.spit...@gmail.com> wrote: > > > >>>> > > > >>>>> That's fair, I'm sold on an Iceberg Module. > > > >>>>> > > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue > <b...@databricks.com.invalid> > > > >>>>> wrote: > > > >>>>> > > > >>>>>> > Feels like eventually the encoding should land in parquet > proper > > > >>>>>> right? > > > >>>>>> > > > >>>>>> What about using it in ORC? I don't know where it should end up. > > > >>>>>> Maybe Iceberg should make a standalone module from it? > > > >>>>>> > > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < > > > >>>>>> russell.spit...@gmail.com> wrote: > > > >>>>>> > > > >>>>>>> Feels like eventually the encoding should land in parquet > proper > > > >>>>>>> right? I'm fine with us just copying into Iceberg though for > the time > > > >>>>>>> being. > > > >>>>>>> > > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue > > > >>>>>>> <b...@databricks.com.invalid> wrote: > > > >>>>>>> > > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up in > his > > > >>>>>>>> last email: > > > >>>>>>>> > > > >>>>>>>> > do we have an issue to directly use Spark implementation in > > > >>>>>>>> Iceberg? > > > >>>>>>>> > > > >>>>>>>> Yes, I think that we do have an issue using the Spark > library. What > > > >>>>>>>> do you think about a Java implementation in Iceberg? > > > >>>>>>>> > > > >>>>>>>> Ryan > > > >>>>>>>> > > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue < > b...@databricks.com> > > > >>>>>>>> wrote: > > > >>>>>>>> > > > >>>>>>>>> I raised the same point from Peter's email in a comment on > the doc > > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that would > be a much > > > >>>>>>>>> smaller scope than relying on large portions of Spark, but I > even then I > > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on that > because it is a > > > >>>>>>>>> Scala artifact and we would need to bring in a ton of Scala > libs. I think > > > >>>>>>>>> what makes the most sense is to have an independent > implementation of the > > > >>>>>>>>> spec in Iceberg. > > > >>>>>>>>> > > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < > > > >>>>>>>>> peter.vary.apa...@gmail.com> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Hi Aihua, > > > >>>>>>>>>> Long time no see :) > > > >>>>>>>>>> Would this mean, that every engine which plans to support > Variant > > > >>>>>>>>>> data type needs to add Spark as a dependency? Like > Flink/Trino/Hive etc? > > > >>>>>>>>>> Thanks, Peter > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> > wrote: > > > >>>>>>>>>> > > > >>>>>>>>>>> Thanks Ryan. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark > encoding to > > > >>>>>>>>>>> keep compatibility for the open source engines. > > > >>>>>>>>>>> > > > >>>>>>>>>>> One more question regarding the encoding implementation: > do we > > > >>>>>>>>>>> have an issue to directly use Spark implementation in > Iceberg? Russell > > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency and > that could be a > > > >>>>>>>>>>> problem? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Thanks, > > > >>>>>>>>>>> Aihua > > > >>>>>>>>>>> > > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: > > > >>>>>>>>>>> > Thanks, Aihua! > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > I think that the encoding choice in the current doc is a > good > > > >>>>>>>>>>> one. I went > > > >>>>>>>>>>> > through the Spark encoding in detail and it looks like a > > > >>>>>>>>>>> better choice than > > > >>>>>>>>>>> > the other candidate encodings for quickly accessing > nested > > > >>>>>>>>>>> fields. > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > Another reason to use the Spark type is that this is what > > > >>>>>>>>>>> Delta's variant > > > >>>>>>>>>>> > type is based on, so Parquet files in tables written by > Delta > > > >>>>>>>>>>> could be > > > >>>>>>>>>>> > converted or used in Iceberg tables without needing to > rewrite > > > >>>>>>>>>>> variant > > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have an > > > >>>>>>>>>>> interest in > > > >>>>>>>>>>> > increasing format compatibility.) > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > Ryan > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu < > > > >>>>>>>>>>> aihua...@snowflake.com.invalid> > > > >>>>>>>>>>> > wrote: > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > It’s great to be able to present the Variant type > proposal > > > >>>>>>>>>>> in the > > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a > meeting > > > >>>>>>>>>>> next week > > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any further > > > >>>>>>>>>>> concerns about the > > > >>>>>>>>>>> > > encoding of the Variant type and any other questions > on the > > > >>>>>>>>>>> first phase of > > > >>>>>>>>>>> > > the proposal > > > >>>>>>>>>>> > > < > > > >>>>>>>>>>> > https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit > > > >>>>>>>>>>> >. > > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the > proposal > > > >>>>>>>>>>> can either join > > > >>>>>>>>>>> > > or reply with their comments so we can discuss them. > Summary > > > >>>>>>>>>>> of the > > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing list > for > > > >>>>>>>>>>> further comment > > > >>>>>>>>>>> > > there. > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > - > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > What should be the underlying binary representation > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc including > ION, > > > >>>>>>>>>>> JSONB, and > > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is an > > > >>>>>>>>>>> important first step > > > >>>>>>>>>>> > > here and we believe we have general support for Spark’s > > > >>>>>>>>>>> Variant encoding. > > > >>>>>>>>>>> > > We would like to hear if anyone else has strong > opinions in > > > >>>>>>>>>>> this space. > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > - > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Should we support multiple logical types or just > Variant? > > > >>>>>>>>>>> Variant vs. > > > >>>>>>>>>>> > > Variant + JSON. > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be > supported > > > >>>>>>>>>>> in Iceberg - > > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would > share the > > > >>>>>>>>>>> same underlying > > > >>>>>>>>>>> > > encoding but would imply different limitations on > engines > > > >>>>>>>>>>> working with > > > >>>>>>>>>>> > > those types. > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring toward > > > >>>>>>>>>>> supporting Variant > > > >>>>>>>>>>> > > only and we want to have a consensus on the supported > > > >>>>>>>>>>> type(s). > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > - > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > How should we move forward with Subcolumnization? > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant type by > > > >>>>>>>>>>> separating out > > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not > critical for > > > >>>>>>>>>>> choosing the > > > >>>>>>>>>>> > > initial encoding of the Variant type so we were hoping > to > > > >>>>>>>>>>> gain consensus on > > > >>>>>>>>>>> > > leaving that for a follow up spec. > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Thanks > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Aihua > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Meeting invite: > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am > > > >>>>>>>>>>> > > Time zone: America/Los_Angeles > > > >>>>>>>>>>> > > Google Meet joining info > > > >>>>>>>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq > > > >>>>>>>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# > > > >>>>>>>>>>> > > More phone numbers: > > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu < > > > >>>>>>>>>>> aihua...@snowflake.com> wrote: > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> > >> Hello, > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>> > >> We have drafted the proposal > > > >>>>>>>>>>> > >> < > > > >>>>>>>>>>> > https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > >> for Variant data type. Please help review and comment. > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>> > >> Thanks, > > > >>>>>>>>>>> > >> Aihua > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye < > > > >>>>>>>>>>> yezhao...@gmail.com> wrote: > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same > > > >>>>>>>>>>> discussion internally > > > >>>>>>>>>>> > >>> and a JSON type would really play well with for > example > > > >>>>>>>>>>> the SUPER type in > > > >>>>>>>>>>> > >>> Redshift: > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, > > > >>>>>>>>>>> and > > > >>>>>>>>>>> > >>> can also provide better integration with the Trino > JSON > > > >>>>>>>>>>> type. > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > >>> Looking forward to the proposal! > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > >>> Best, > > > >>>>>>>>>>> > >>> Jack Ye > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau > > > >>>>>>>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: > > > >>>>>>>>>>> > >>> > > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu < > ust...@gmail.com> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we > need to > > > >>>>>>>>>>> look at; > > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but weren't > sure > > > >>>>>>>>>>> how much > > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。 > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java > world. It > > > >>>>>>>>>>> would be > > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it > takes to > > > >>>>>>>>>>> integrate > > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion, > etc.). > > > >>>>>>>>>>> This is something > > > >>>>>>>>>>> > >>>>> that > > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care about. > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some > > > >>>>>>>>>>> perspective on this. > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a binary > type > > > >>>>>>>>>>> and Iceberg and > > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary > column > > > >>>>>>>>>>> needs to be > > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be > sufficient. > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it would > be > > > >>>>>>>>>>> good to support > > > >>>>>>>>>>> > >>>>> native > > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for > projects > > > >>>>>>>>>>> like Apache > > > >>>>>>>>>>> > >>>>> XTable. > > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained > statistics > > > >>>>>>>>>>> for variant > > > >>>>>>>>>>> > >>>>> type which > > > >>>>>>>>>>> > >>>>> facilitates data skipping. > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value in > > > >>>>>>>>>>> native file format > > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's not > a > > > >>>>>>>>>>> strict requirement. > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>> -Tyler > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>> > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>>> Gang > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau > > > >>>>>>>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: > > > >>>>>>>>>>> > >>>>> > > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks! > > > >>>>>>>>>>> > >>>>>> > > > >>>>>>>>>>> > >>>>>> -Tyler > > > >>>>>>>>>>> > >>>>>> > > > >>>>>>>>>>> > >>>>>> > > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste > Onofré < > > > >>>>>>>>>>> j...@nanthrax.net> > > > >>>>>>>>>>> > >>>>>> wrote: > > > >>>>>>>>>>> > >>>>>> > > > >>>>>>>>>>> > >>>>>>> Hi Tyler, > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me our > > > >>>>>>>>>>> discussions back in > > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :) > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I > remember > > > >>>>>>>>>>> some discussions > > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary data > type > > > >>>>>>>>>>> is already > > > >>>>>>>>>>> > >>>>>>> supported in the spec v2. > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy to > help > > > >>>>>>>>>>> on this ! > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>>> Regards > > > >>>>>>>>>>> > >>>>>>> JB > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau > > > >>>>>>>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > Hello, > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working > on a > > > >>>>>>>>>>> proposal for > > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the > > > >>>>>>>>>>> community. As you may know, > > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data > Lake > > > >>>>>>>>>>> format. Having made > > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the Iceberg > > > >>>>>>>>>>> standard, we’re now in a > > > >>>>>>>>>>> > >>>>>>> position where there are features not yet > supported in > > > >>>>>>>>>>> Iceberg which we > > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and that > we > > > >>>>>>>>>>> would like to discuss > > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg > community. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to > discuss are > > > >>>>>>>>>>> in support of > > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed, > > > >>>>>>>>>>> semi-structured data: variant data > > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant > columns. In > > > >>>>>>>>>>> more detail, for > > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar: > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types > > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary > > > >>>>>>>>>>> encoding of dynamic > > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By > > > >>>>>>>>>>> encoding semi-structured > > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the > flexibility of > > > >>>>>>>>>>> the source data, > > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more efficiently > > > >>>>>>>>>>> operate on the data. > > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type on > > > >>>>>>>>>>> Snowflake tables for many > > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg > > > >>>>>>>>>>> tables in Snowflake, > > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of requests > for > > > >>>>>>>>>>> variant support. > > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as Apache > Spark > > > >>>>>>>>>>> have begun adding > > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it > would be > > > >>>>>>>>>>> beneficial to the > > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize on > the > > > >>>>>>>>>>> variant data type > > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that, since > an > > > >>>>>>>>>>> Apache OSS > > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in > Spark, > > > >>>>>>>>>>> it likely makes sense > > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg > > > >>>>>>>>>>> standard as well. The > > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake is > > > >>>>>>>>>>> slightly different, but > > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no particular > value > > > >>>>>>>>>>> in trying to clutter > > > >>>>>>>>>>> > >>>>>>> the space with another > equivalent-but-incompatible > > > >>>>>>>>>>> encoding. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization > > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns allows > query > > > >>>>>>>>>>> engines to > > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., > > > >>>>>>>>>>> nested fields) within a > > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows > optionally > > > >>>>>>>>>>> materializing some > > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own, > > > >>>>>>>>>>> affording queries on these > > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and > spend > > > >>>>>>>>>>> less CPU on extraction. > > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing table > > > >>>>>>>>>>> metadata and data tracks > > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max, null, > etc.) > > > >>>>>>>>>>> for some subset of the > > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also manages > any > > > >>>>>>>>>>> optional > > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization, any > query > > > >>>>>>>>>>> which touches a > > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and > filter > > > >>>>>>>>>>> every row for which > > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a > > > >>>>>>>>>>> standardized way of tracking > > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant columns, > > > >>>>>>>>>>> Iceberg can make > > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across > various > > > >>>>>>>>>>> catalogs and query > > > >>>>>>>>>>> > >>>>>>> engines. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so > we > > > >>>>>>>>>>> expect any > > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set of > > > >>>>>>>>>>> changes to Iceberg > > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines to > > > >>>>>>>>>>> interopate on > > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns, but > also > > > >>>>>>>>>>> reference > > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization > principles > > > >>>>>>>>>>> and recommended best > > > >>>>>>>>>>> > >>>>>>> practices. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may > be a > > > >>>>>>>>>>> good starting > > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan is to > > > >>>>>>>>>>> write something up in > > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec changes, > > > >>>>>>>>>>> backwards compatibility, > > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to first > reach > > > >>>>>>>>>>> out to the community > > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see if > > > >>>>>>>>>>> there’s any early feedback > > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too much > time on > > > >>>>>>>>>>> a concrete proposal. > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > Thank you! > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > [1] > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > https://docs.snowflake.com/en/sql-reference/data-types-semistructured > > > >>>>>>>>>>> > >>>>>>> > [2] > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > https://github.com/apache/spark/blob/master/common/variant/README.md > > > >>>>>>>>>>> > >>>>>>> > [3] > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua > > > >>>>>>>>>>> > >>>>>>> > > > > >>>>>>>>>>> > >>>>>>> > > > >>>>>>>>>>> > >>>>>> > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > -- > > > >>>>>>>>>>> > Ryan Blue > > > >>>>>>>>>>> > Databricks > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Ryan Blue > > > >>>>>>>>> Databricks > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> Ryan Blue > > > >>>>>>>> Databricks > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Ryan Blue > > > >>>>>> Databricks > > > >>>>>> > > > >>>>> > > > >>>> > > > >>>> -- > > > >>>> Ryan Blue > > > >>>> Databricks > > > >>>> > > > >>> > > > >> > > > >> -- > > > >> Ryan Blue > > > >> Databricks > > > >> > > > > > > > > > >