On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: > > We may need some guidance on just how many we need to look at; > > we were planning on Spark and Trino, but weren't sure how much > > further down the rabbit hole we needed to go。 > > There are some engines living outside the Java world. It would be > good if the proposal could cover the effort it takes to integrate > variant type to them (e.g. velox, datafusion, etc.). This is something that > some proprietary iceberg vendors also care about. >
Ack, makes sense. We can make sure to share some perspective on this. > Not necessarily, no. As long as there's a binary type and Iceberg and > > the query engines are aware that the binary column needs to be > > interpreted as a variant, that should be sufficient. > > From the perspective of interoperability, it would be good to support > native > type from file specs. Life will be easier for projects like Apache XTable. > File format could also provide finer-grained statistics for variant type > which > facilitates data skipping. > Agreed, there can definitely be additional value in native file format integration. Just wanted to highlight that it's not a strict requirement. -Tyler > > Gang > > On Wed, May 15, 2024 at 6:49 AM Tyler Akidau > <tyler.aki...@snowflake.com.invalid> wrote: > >> Good to see you again as well, JB! Thanks! >> >> -Tyler >> >> >> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Tyler, >>> >>> Super happy to see you there :) It reminds me our discussions back in >>> the start of Apache Beam :) >>> >>> Anyway, the thread is pretty interesting. I remember some discussions >>> about JSON datatype for spec v3. The binary data type is already >>> supported in the spec v2. >>> >>> I'm looking forward to the proposal and happy to help on this ! >>> >>> Regards >>> JB >>> >>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>> <tyler.aki...@snowflake.com.invalid> wrote: >>> > >>> > Hello, >>> > >>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which >>> we’d like to get early feedback from the community. As you may know, >>> Snowflake has embraced Iceberg as its open Data Lake format. Having made >>> good progress on our own adoption of the Iceberg standard, we’re now in a >>> position where there are features not yet supported in Iceberg which we >>> think would be valuable for our users, and that we would like to discuss >>> with and help contribute to the Iceberg community. >>> > >>> > The first two such features we’d like to discuss are in support of >>> efficient querying of dynamically typed, semi-structured data: variant data >>> types, and subcolumnarization of variant columns. In more detail, for >>> anyone who may not already be familiar: >>> > >>> > 1. Variant data types >>> > Variant types allow for the efficient binary encoding of dynamic >>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured >>> data as a variant column, we retain the flexibility of the source data, >>> while allowing query engines to more efficiently operate on the data. >>> Snowflake has supported the variant data type on Snowflake tables for many >>> years [1]. As more and more users utilize Iceberg tables in Snowflake, >>> we’re hearing an increasing chorus of requests for variant support. >>> Additionally, other query engines such as Apache Spark have begun adding >>> variant support [2]. As such, we believe it would be beneficial to the >>> Iceberg community as a whole to standardize on the variant data type >>> encoding used across Iceberg tables. >>> > >>> > One specific point to make here is that, since an Apache OSS version >>> of variant encoding already exists in Spark, it likely makes sense to >>> simply adopt the Spark encoding as the Iceberg standard as well. The >>> encoding we use internally today in Snowflake is slightly different, but >>> essentially equivalent, and we see no particular value in trying to clutter >>> the space with another equivalent-but-incompatible encoding. >>> > >>> > >>> > 2. Subcolumnarization >>> > Subcolumnarization of variant columns allows query engines to >>> efficiently prune datasets when subcolumns (i.e., nested fields) within a >>> variant column are queried, and also allows optionally materializing some >>> of the nested fields as a column on their own, affording queries on these >>> subcolumns the ability to read less data and spend less CPU on extraction. >>> When subcolumnarizing, the system managing table metadata and data tracks >>> individual pruning statistics (min, max, null, etc.) for some subset of the >>> nested fields within a variant, and also manages any optional >>> materialization. Without subcolumnarization, any query which touches a >>> variant column must read, parse, extract, and filter every row for which >>> that column is non-null. Thus, by providing a standardized way of tracking >>> subcolum metadata and data for variant columns, Iceberg can make >>> subcolumnar optimizations accessible across various catalogs and query >>> engines. >>> > >>> > Subcolumnarization is a non-trivial topic, so we expect any concrete >>> proposal to include not only the set of changes to Iceberg metadata that >>> allow compatible query engines to interopate on subcolumnarization data for >>> variant columns, but also reference documentation explaining >>> subcolumnarization principles and recommended best practices. >>> > >>> > >>> > It sounds like the recent Geo proposal [3] may be a good starting >>> point for how to approach this, so our plan is to write something up in >>> that vein that covers the proposed spec changes, backwards compatibility, >>> implementor burdens, etc. But we wanted to first reach out to the community >>> to introduce ourselves and the idea, and see if there’s any early feedback >>> we should incorporate before we spend too much time on a concrete proposal. >>> > >>> > Thank you! >>> > >>> > [1] >>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>> > [2] >>> https://github.com/apache/spark/blob/master/common/variant/README.md >>> > [3] >>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>> > >>> > -Tyler, Nileema, Selcuk, Aihua >>> > >>> >>