Hello, We have drafted the proposal <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit> for Variant data type. Please help review and comment.
Thanks, Aihua On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> wrote: > +10000 for a JSON/BSON type. We also had the same discussion internally > and a JSON type would really play well with for example the SUPER type in > Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, > and can also provide better integration with the Trino JSON type. > > Looking forward to the proposal! > > Best, > Jack Ye > > > On Wed, May 15, 2024 at 9:37 AM Tyler Akidau > <tyler.aki...@snowflake.com.invalid> wrote: > >> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: >> >>> > We may need some guidance on just how many we need to look at; >>> > we were planning on Spark and Trino, but weren't sure how much >>> > further down the rabbit hole we needed to go。 >>> >>> There are some engines living outside the Java world. It would be >>> good if the proposal could cover the effort it takes to integrate >>> variant type to them (e.g. velox, datafusion, etc.). This is something >>> that >>> some proprietary iceberg vendors also care about. >>> >> >> Ack, makes sense. We can make sure to share some perspective on this. >> >> > Not necessarily, no. As long as there's a binary type and Iceberg and >>> > the query engines are aware that the binary column needs to be >>> > interpreted as a variant, that should be sufficient. >>> >>> From the perspective of interoperability, it would be good to support >>> native >>> type from file specs. Life will be easier for projects like Apache >>> XTable. >>> File format could also provide finer-grained statistics for variant type >>> which >>> facilitates data skipping. >>> >> >> Agreed, there can definitely be additional value in native file format >> integration. Just wanted to highlight that it's not a strict requirement. >> >> -Tyler >> >> >>> >>> Gang >>> >>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>> <tyler.aki...@snowflake.com.invalid> wrote: >>> >>>> Good to see you again as well, JB! Thanks! >>>> >>>> -Tyler >>>> >>>> >>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >>>>> Hi Tyler, >>>>> >>>>> Super happy to see you there :) It reminds me our discussions back in >>>>> the start of Apache Beam :) >>>>> >>>>> Anyway, the thread is pretty interesting. I remember some discussions >>>>> about JSON datatype for spec v3. The binary data type is already >>>>> supported in the spec v2. >>>>> >>>>> I'm looking forward to the proposal and happy to help on this ! >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>> > >>>>> > Hello, >>>>> > >>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for >>>>> which we’d like to get early feedback from the community. As you may know, >>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made >>>>> good progress on our own adoption of the Iceberg standard, we’re now in a >>>>> position where there are features not yet supported in Iceberg which we >>>>> think would be valuable for our users, and that we would like to discuss >>>>> with and help contribute to the Iceberg community. >>>>> > >>>>> > The first two such features we’d like to discuss are in support of >>>>> efficient querying of dynamically typed, semi-structured data: variant >>>>> data >>>>> types, and subcolumnarization of variant columns. In more detail, for >>>>> anyone who may not already be familiar: >>>>> > >>>>> > 1. Variant data types >>>>> > Variant types allow for the efficient binary encoding of dynamic >>>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured >>>>> data as a variant column, we retain the flexibility of the source data, >>>>> while allowing query engines to more efficiently operate on the data. >>>>> Snowflake has supported the variant data type on Snowflake tables for many >>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake, >>>>> we’re hearing an increasing chorus of requests for variant support. >>>>> Additionally, other query engines such as Apache Spark have begun adding >>>>> variant support [2]. As such, we believe it would be beneficial to the >>>>> Iceberg community as a whole to standardize on the variant data type >>>>> encoding used across Iceberg tables. >>>>> > >>>>> > One specific point to make here is that, since an Apache OSS version >>>>> of variant encoding already exists in Spark, it likely makes sense to >>>>> simply adopt the Spark encoding as the Iceberg standard as well. The >>>>> encoding we use internally today in Snowflake is slightly different, but >>>>> essentially equivalent, and we see no particular value in trying to >>>>> clutter >>>>> the space with another equivalent-but-incompatible encoding. >>>>> > >>>>> > >>>>> > 2. Subcolumnarization >>>>> > Subcolumnarization of variant columns allows query engines to >>>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a >>>>> variant column are queried, and also allows optionally materializing some >>>>> of the nested fields as a column on their own, affording queries on these >>>>> subcolumns the ability to read less data and spend less CPU on extraction. >>>>> When subcolumnarizing, the system managing table metadata and data tracks >>>>> individual pruning statistics (min, max, null, etc.) for some subset of >>>>> the >>>>> nested fields within a variant, and also manages any optional >>>>> materialization. Without subcolumnarization, any query which touches a >>>>> variant column must read, parse, extract, and filter every row for which >>>>> that column is non-null. Thus, by providing a standardized way of tracking >>>>> subcolum metadata and data for variant columns, Iceberg can make >>>>> subcolumnar optimizations accessible across various catalogs and query >>>>> engines. >>>>> > >>>>> > Subcolumnarization is a non-trivial topic, so we expect any concrete >>>>> proposal to include not only the set of changes to Iceberg metadata that >>>>> allow compatible query engines to interopate on subcolumnarization data >>>>> for >>>>> variant columns, but also reference documentation explaining >>>>> subcolumnarization principles and recommended best practices. >>>>> > >>>>> > >>>>> > It sounds like the recent Geo proposal [3] may be a good starting >>>>> point for how to approach this, so our plan is to write something up in >>>>> that vein that covers the proposed spec changes, backwards compatibility, >>>>> implementor burdens, etc. But we wanted to first reach out to the >>>>> community >>>>> to introduce ourselves and the idea, and see if there’s any early feedback >>>>> we should incorporate before we spend too much time on a concrete >>>>> proposal. >>>>> > >>>>> > Thank you! >>>>> > >>>>> > [1] >>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>> > [2] >>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>> > [3] >>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>> > >>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>> > >>>>> >>>>