Good to see you again as well, JB! Thanks! -Tyler
On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Tyler, > > Super happy to see you there :) It reminds me our discussions back in > the start of Apache Beam :) > > Anyway, the thread is pretty interesting. I remember some discussions > about JSON datatype for spec v3. The binary data type is already > supported in the spec v2. > > I'm looking forward to the proposal and happy to help on this ! > > Regards > JB > > On Sat, May 11, 2024 at 7:06 AM Tyler Akidau > <tyler.aki...@snowflake.com.invalid> wrote: > > > > Hello, > > > > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which > we’d like to get early feedback from the community. As you may know, > Snowflake has embraced Iceberg as its open Data Lake format. Having made > good progress on our own adoption of the Iceberg standard, we’re now in a > position where there are features not yet supported in Iceberg which we > think would be valuable for our users, and that we would like to discuss > with and help contribute to the Iceberg community. > > > > The first two such features we’d like to discuss are in support of > efficient querying of dynamically typed, semi-structured data: variant data > types, and subcolumnarization of variant columns. In more detail, for > anyone who may not already be familiar: > > > > 1. Variant data types > > Variant types allow for the efficient binary encoding of dynamic > semi-structured data such as JSON, Avro, etc. By encoding semi-structured > data as a variant column, we retain the flexibility of the source data, > while allowing query engines to more efficiently operate on the data. > Snowflake has supported the variant data type on Snowflake tables for many > years [1]. As more and more users utilize Iceberg tables in Snowflake, > we’re hearing an increasing chorus of requests for variant support. > Additionally, other query engines such as Apache Spark have begun adding > variant support [2]. As such, we believe it would be beneficial to the > Iceberg community as a whole to standardize on the variant data type > encoding used across Iceberg tables. > > > > One specific point to make here is that, since an Apache OSS version of > variant encoding already exists in Spark, it likely makes sense to simply > adopt the Spark encoding as the Iceberg standard as well. The encoding we > use internally today in Snowflake is slightly different, but essentially > equivalent, and we see no particular value in trying to clutter the space > with another equivalent-but-incompatible encoding. > > > > > > 2. Subcolumnarization > > Subcolumnarization of variant columns allows query engines to > efficiently prune datasets when subcolumns (i.e., nested fields) within a > variant column are queried, and also allows optionally materializing some > of the nested fields as a column on their own, affording queries on these > subcolumns the ability to read less data and spend less CPU on extraction. > When subcolumnarizing, the system managing table metadata and data tracks > individual pruning statistics (min, max, null, etc.) for some subset of the > nested fields within a variant, and also manages any optional > materialization. Without subcolumnarization, any query which touches a > variant column must read, parse, extract, and filter every row for which > that column is non-null. Thus, by providing a standardized way of tracking > subcolum metadata and data for variant columns, Iceberg can make > subcolumnar optimizations accessible across various catalogs and query > engines. > > > > Subcolumnarization is a non-trivial topic, so we expect any concrete > proposal to include not only the set of changes to Iceberg metadata that > allow compatible query engines to interopate on subcolumnarization data for > variant columns, but also reference documentation explaining > subcolumnarization principles and recommended best practices. > > > > > > It sounds like the recent Geo proposal [3] may be a good starting point > for how to approach this, so our plan is to write something up in that vein > that covers the proposed spec changes, backwards compatibility, implementor > burdens, etc. But we wanted to first reach out to the community to > introduce ourselves and the idea, and see if there’s any early feedback we > should incorporate before we spend too much time on a concrete proposal. > > > > Thank you! > > > > [1] > https://docs.snowflake.com/en/sql-reference/data-types-semistructured > > [2] https://github.com/apache/spark/blob/master/common/variant/README.md > > [3] > https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit > > > > -Tyler, Nileema, Selcuk, Aihua > > >