Hey Aihua and Tyler, Thanks again for raising this. I reviewed the proposal and it looks good, also thanks to everyone for jumping in and providing feedback. Looking at the proposal and comments, I think the biggest open issue that needs to be decided is Subcolumnarization vs native type <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit?disco=AAABOfRqAn8>. The Spark variant encoding also has an open PR on the subject of subcolumnarization <https://github.com/apache/spark/pull/46831/>.
To keep this moving, I would suggest that the document's authors go over the open issues and try to resolve low-hanging fruit. This will clean up the proposal already quite a bit. Then we can come up with a list of open questions (happy to help) and have a meeting to discuss these. WDYT? Kind regards, Fokko Driesprong Op vr 31 mei 2024 om 18:54 schreef Aihua Xu <aihua...@snowflake.com.invalid >: > Hello, > > We have drafted the proposal > <https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit> > for Variant data type. Please help review and comment. > > Thanks, > Aihua > > On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> wrote: > >> +10000 for a JSON/BSON type. We also had the same discussion internally >> and a JSON type would really play well with for example the SUPER type in >> Redshift: >> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and >> can also provide better integration with the Trino JSON type. >> >> Looking forward to the proposal! >> >> Best, >> Jack Ye >> >> >> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >> <tyler.aki...@snowflake.com.invalid> wrote: >> >>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: >>> >>>> > We may need some guidance on just how many we need to look at; >>>> > we were planning on Spark and Trino, but weren't sure how much >>>> > further down the rabbit hole we needed to go。 >>>> >>>> There are some engines living outside the Java world. It would be >>>> good if the proposal could cover the effort it takes to integrate >>>> variant type to them (e.g. velox, datafusion, etc.). This is something >>>> that >>>> some proprietary iceberg vendors also care about. >>>> >>> >>> Ack, makes sense. We can make sure to share some perspective on this. >>> >>> > Not necessarily, no. As long as there's a binary type and Iceberg and >>>> > the query engines are aware that the binary column needs to be >>>> > interpreted as a variant, that should be sufficient. >>>> >>>> From the perspective of interoperability, it would be good to support >>>> native >>>> type from file specs. Life will be easier for projects like Apache >>>> XTable. >>>> File format could also provide finer-grained statistics for variant >>>> type which >>>> facilitates data skipping. >>>> >>> >>> Agreed, there can definitely be additional value in native file format >>> integration. Just wanted to highlight that it's not a strict requirement. >>> >>> -Tyler >>> >>> >>>> >>>> Gang >>>> >>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>> >>>>> Good to see you again as well, JB! Thanks! >>>>> >>>>> -Tyler >>>>> >>>>> >>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> wrote: >>>>> >>>>>> Hi Tyler, >>>>>> >>>>>> Super happy to see you there :) It reminds me our discussions back in >>>>>> the start of Apache Beam :) >>>>>> >>>>>> Anyway, the thread is pretty interesting. I remember some discussions >>>>>> about JSON datatype for spec v3. The binary data type is already >>>>>> supported in the spec v2. >>>>>> >>>>>> I'm looking forward to the proposal and happy to help on this ! >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>> > >>>>>> > Hello, >>>>>> > >>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for >>>>>> which we’d like to get early feedback from the community. As you may >>>>>> know, >>>>>> Snowflake has embraced Iceberg as its open Data Lake format. Having made >>>>>> good progress on our own adoption of the Iceberg standard, we’re now in a >>>>>> position where there are features not yet supported in Iceberg which we >>>>>> think would be valuable for our users, and that we would like to discuss >>>>>> with and help contribute to the Iceberg community. >>>>>> > >>>>>> > The first two such features we’d like to discuss are in support of >>>>>> efficient querying of dynamically typed, semi-structured data: variant >>>>>> data >>>>>> types, and subcolumnarization of variant columns. In more detail, for >>>>>> anyone who may not already be familiar: >>>>>> > >>>>>> > 1. Variant data types >>>>>> > Variant types allow for the efficient binary encoding of dynamic >>>>>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured >>>>>> data as a variant column, we retain the flexibility of the source data, >>>>>> while allowing query engines to more efficiently operate on the data. >>>>>> Snowflake has supported the variant data type on Snowflake tables for >>>>>> many >>>>>> years [1]. As more and more users utilize Iceberg tables in Snowflake, >>>>>> we’re hearing an increasing chorus of requests for variant support. >>>>>> Additionally, other query engines such as Apache Spark have begun adding >>>>>> variant support [2]. As such, we believe it would be beneficial to the >>>>>> Iceberg community as a whole to standardize on the variant data type >>>>>> encoding used across Iceberg tables. >>>>>> > >>>>>> > One specific point to make here is that, since an Apache OSS >>>>>> version of variant encoding already exists in Spark, it likely makes >>>>>> sense >>>>>> to simply adopt the Spark encoding as the Iceberg standard as well. The >>>>>> encoding we use internally today in Snowflake is slightly different, but >>>>>> essentially equivalent, and we see no particular value in trying to >>>>>> clutter >>>>>> the space with another equivalent-but-incompatible encoding. >>>>>> > >>>>>> > >>>>>> > 2. Subcolumnarization >>>>>> > Subcolumnarization of variant columns allows query engines to >>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) within a >>>>>> variant column are queried, and also allows optionally materializing some >>>>>> of the nested fields as a column on their own, affording queries on these >>>>>> subcolumns the ability to read less data and spend less CPU on >>>>>> extraction. >>>>>> When subcolumnarizing, the system managing table metadata and data tracks >>>>>> individual pruning statistics (min, max, null, etc.) for some subset of >>>>>> the >>>>>> nested fields within a variant, and also manages any optional >>>>>> materialization. Without subcolumnarization, any query which touches a >>>>>> variant column must read, parse, extract, and filter every row for which >>>>>> that column is non-null. Thus, by providing a standardized way of >>>>>> tracking >>>>>> subcolum metadata and data for variant columns, Iceberg can make >>>>>> subcolumnar optimizations accessible across various catalogs and query >>>>>> engines. >>>>>> > >>>>>> > Subcolumnarization is a non-trivial topic, so we expect any >>>>>> concrete proposal to include not only the set of changes to Iceberg >>>>>> metadata that allow compatible query engines to interopate on >>>>>> subcolumnarization data for variant columns, but also reference >>>>>> documentation explaining subcolumnarization principles and recommended >>>>>> best >>>>>> practices. >>>>>> > >>>>>> > >>>>>> > It sounds like the recent Geo proposal [3] may be a good starting >>>>>> point for how to approach this, so our plan is to write something up in >>>>>> that vein that covers the proposed spec changes, backwards compatibility, >>>>>> implementor burdens, etc. But we wanted to first reach out to the >>>>>> community >>>>>> to introduce ourselves and the idea, and see if there’s any early >>>>>> feedback >>>>>> we should incorporate before we spend too much time on a concrete >>>>>> proposal. >>>>>> > >>>>>> > Thank you! >>>>>> > >>>>>> > [1] >>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>>> > [2] >>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>>> > [3] >>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>>> > >>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>>> > >>>>>> >>>>>