Thanks Ryan. Yeah. That's another reason we want to pursue Spark encoding to keep compatibility for the open source engines.
One more question regarding the encoding implementation: do we have an issue to directly use Spark implementation in Iceberg? Russell pointed out that Trino doesn't have Spark dependency and that could be a problem? Thanks, Aihua On 2024/07/12 15:02:06 Ryan Blue wrote: > Thanks, Aihua! > > I think that the encoding choice in the current doc is a good one. I went > through the Spark encoding in detail and it looks like a better choice than > the other candidate encodings for quickly accessing nested fields. > > Another reason to use the Spark type is that this is what Delta's variant > type is based on, so Parquet files in tables written by Delta could be > converted or used in Iceberg tables without needing to rewrite variant > data. (Also, note that I work at Databricks and have an interest in > increasing format compatibility.) > > Ryan > > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com.invalid> > wrote: > > > [Discuss] Consensus for Variant Encoding > > > > It’s great to be able to present the Variant type proposal in the > > community sync yesterday and I’m looking to host a meeting next week > > (targeting for 9am, July 17th) to go over any further concerns about the > > encoding of the Variant type and any other questions on the first phase of > > the proposal > > <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit>. > > We are hoping that anyone who is interested in the proposal can either join > > or reply with their comments so we can discuss them. Summary of the > > discussion and notes will be sent to the mailing list for further comment > > there. > > > > > > - > > > > What should be the underlying binary representation > > > > We have evaluated a few encodings in the doc including ION, JSONB, and > > Spark encoding.Choosing the underlying encoding is an important first step > > here and we believe we have general support for Spark’s Variant encoding. > > We would like to hear if anyone else has strong opinions in this space. > > > > > > - > > > > Should we support multiple logical types or just Variant? Variant vs. > > Variant + JSON. > > > > This is to discuss what logical data type(s) to be supported in Iceberg - > > Variant only vs. Variant + JSON. Both types would share the same underlying > > encoding but would imply different limitations on engines working with > > those types. > > > > From the sync up meeting, we are more favoring toward supporting Variant > > only and we want to have a consensus on the supported type(s). > > > > > > - > > > > How should we move forward with Subcolumnization? > > > > Subcolumnization is an optimization for Variant type by separating out > > subcolumns with their own metadata. This is not critical for choosing the > > initial encoding of the Variant type so we were hoping to gain consensus on > > leaving that for a follow up spec. > > > > > > Thanks > > > > Aihua > > > > Meeting invite: > > > > Wednesday, July 17 · 9:00 – 10:00am > > Time zone: America/Los_Angeles > > Google Meet joining info > > Video call link: https://meet.google.com/pbm-ovzn-aoq > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 > > > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> wrote: > > > >> Hello, > >> > >> We have drafted the proposal > >> <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit> > >> for Variant data type. Please help review and comment. > >> > >> Thanks, > >> Aihua > >> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> wrote: > >> > >>> +10000 for a JSON/BSON type. We also had the same discussion internally > >>> and a JSON type would really play well with for example the SUPER type in > >>> Redshift: > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and > >>> can also provide better integration with the Trino JSON type. > >>> > >>> Looking forward to the proposal! > >>> > >>> Best, > >>> Jack Ye > >>> > >>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau > >>> <tyler.aki...@snowflake.com.invalid> wrote: > >>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: > >>>> > >>>>> > We may need some guidance on just how many we need to look at; > >>>>> > we were planning on Spark and Trino, but weren't sure how much > >>>>> > further down the rabbit hole we needed to go。 > >>>>> > >>>>> There are some engines living outside the Java world. It would be > >>>>> good if the proposal could cover the effort it takes to integrate > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is something > >>>>> that > >>>>> some proprietary iceberg vendors also care about. > >>>>> > >>>> > >>>> Ack, makes sense. We can make sure to share some perspective on this. > >>>> > >>>> > Not necessarily, no. As long as there's a binary type and Iceberg and > >>>>> > the query engines are aware that the binary column needs to be > >>>>> > interpreted as a variant, that should be sufficient. > >>>>> > >>>>> From the perspective of interoperability, it would be good to support > >>>>> native > >>>>> type from file specs. Life will be easier for projects like Apache > >>>>> XTable. > >>>>> File format could also provide finer-grained statistics for variant > >>>>> type which > >>>>> facilitates data skipping. > >>>>> > >>>> > >>>> Agreed, there can definitely be additional value in native file format > >>>> integration. Just wanted to highlight that it's not a strict requirement. > >>>> > >>>> -Tyler > >>>> > >>>> > >>>>> > >>>>> Gang > >>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: > >>>>> > >>>>>> Good to see you again as well, JB! Thanks! > >>>>>> > >>>>>> -Tyler > >>>>>> > >>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré > >>>>>> <j...@nanthrax.net> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Tyler, > >>>>>>> > >>>>>>> Super happy to see you there :) It reminds me our discussions back in > >>>>>>> the start of Apache Beam :) > >>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some discussions > >>>>>>> about JSON datatype for spec v3. The binary data type is already > >>>>>>> supported in the spec v2. > >>>>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on this ! > >>>>>>> > >>>>>>> Regards > >>>>>>> JB > >>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: > >>>>>>> > > >>>>>>> > Hello, > >>>>>>> > > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for > >>>>>>> which we’d like to get early feedback from the community. As you may > >>>>>>> know, > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having > >>>>>>> made > >>>>>>> good progress on our own adoption of the Iceberg standard, we’re now > >>>>>>> in a > >>>>>>> position where there are features not yet supported in Iceberg which > >>>>>>> we > >>>>>>> think would be valuable for our users, and that we would like to > >>>>>>> discuss > >>>>>>> with and help contribute to the Iceberg community. > >>>>>>> > > >>>>>>> > The first two such features we’d like to discuss are in support of > >>>>>>> efficient querying of dynamically typed, semi-structured data: > >>>>>>> variant data > >>>>>>> types, and subcolumnarization of variant columns. In more detail, for > >>>>>>> anyone who may not already be familiar: > >>>>>>> > > >>>>>>> > 1. Variant data types > >>>>>>> > Variant types allow for the efficient binary encoding of dynamic > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding > >>>>>>> semi-structured > >>>>>>> data as a variant column, we retain the flexibility of the source > >>>>>>> data, > >>>>>>> while allowing query engines to more efficiently operate on the data. > >>>>>>> Snowflake has supported the variant data type on Snowflake tables for > >>>>>>> many > >>>>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake, > >>>>>>> we’re hearing an increasing chorus of requests for variant support. > >>>>>>> Additionally, other query engines such as Apache Spark have begun > >>>>>>> adding > >>>>>>> variant support [2]. As such, we believe it would be beneficial to the > >>>>>>> Iceberg community as a whole to standardize on the variant data type > >>>>>>> encoding used across Iceberg tables. > >>>>>>> > > >>>>>>> > One specific point to make here is that, since an Apache OSS > >>>>>>> version of variant encoding already exists in Spark, it likely makes > >>>>>>> sense > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as well. > >>>>>>> The > >>>>>>> encoding we use internally today in Snowflake is slightly different, > >>>>>>> but > >>>>>>> essentially equivalent, and we see no particular value in trying to > >>>>>>> clutter > >>>>>>> the space with another equivalent-but-incompatible encoding. > >>>>>>> > > >>>>>>> > > >>>>>>> > 2. Subcolumnarization > >>>>>>> > Subcolumnarization of variant columns allows query engines to > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) > >>>>>>> within a > >>>>>>> variant column are queried, and also allows optionally materializing > >>>>>>> some > >>>>>>> of the nested fields as a column on their own, affording queries on > >>>>>>> these > >>>>>>> subcolumns the ability to read less data and spend less CPU on > >>>>>>> extraction. > >>>>>>> When subcolumnarizing, the system managing table metadata and data > >>>>>>> tracks > >>>>>>> individual pruning statistics (min, max, null, etc.) for some subset > >>>>>>> of the > >>>>>>> nested fields within a variant, and also manages any optional > >>>>>>> materialization. Without subcolumnarization, any query which touches a > >>>>>>> variant column must read, parse, extract, and filter every row for > >>>>>>> which > >>>>>>> that column is non-null. Thus, by providing a standardized way of > >>>>>>> tracking > >>>>>>> subcolum metadata and data for variant columns, Iceberg can make > >>>>>>> subcolumnar optimizations accessible across various catalogs and query > >>>>>>> engines. > >>>>>>> > > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any > >>>>>>> concrete proposal to include not only the set of changes to Iceberg > >>>>>>> metadata that allow compatible query engines to interopate on > >>>>>>> subcolumnarization data for variant columns, but also reference > >>>>>>> documentation explaining subcolumnarization principles and > >>>>>>> recommended best > >>>>>>> practices. > >>>>>>> > > >>>>>>> > > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good starting > >>>>>>> point for how to approach this, so our plan is to write something up > >>>>>>> in > >>>>>>> that vein that covers the proposed spec changes, backwards > >>>>>>> compatibility, > >>>>>>> implementor burdens, etc. But we wanted to first reach out to the > >>>>>>> community > >>>>>>> to introduce ourselves and the idea, and see if there’s any early > >>>>>>> feedback > >>>>>>> we should incorporate before we spend too much time on a concrete > >>>>>>> proposal. > >>>>>>> > > >>>>>>> > Thank you! > >>>>>>> > > >>>>>>> > [1] > >>>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured > >>>>>>> > [2] > >>>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md > >>>>>>> > [3] > >>>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit > >>>>>>> > > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua > >>>>>>> > > >>>>>>> > >>>>>> > > -- > Ryan Blue > Databricks >