Re: [Early Feedback] Variant and Subcolumnarization Support

Tyler Akidau Tue, 14 May 2024 15:50:31 -0700

Good to see you again as well, JB! Thanks!

-Tyler



On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Tyler,
>
> Super happy to see you there :) It reminds me our discussions back in
> the start of Apache Beam :)
>
> Anyway, the thread is pretty interesting. I remember some discussions
> about JSON datatype for spec v3. The binary data type is already
> supported in the spec v2.
>
> I'm looking forward to the proposal and happy to help on this !
>
> Regards
> JB
>
> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
> <tyler.aki...@snowflake.com.invalid> wrote:
> >
> > Hello,
> >
> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which
> we’d like to get early feedback from the community. As you may know,
> Snowflake has embraced Iceberg as its open Data Lake format. Having made
> good progress on our own adoption of the Iceberg standard, we’re now in a
> position where there are features not yet supported in Iceberg which we
> think would be valuable for our users, and that we would like to discuss
> with and help contribute to the Iceberg community.
> >
> > The first two such features we’d like to discuss are in support of
> efficient querying of dynamically typed, semi-structured data: variant data
> types, and subcolumnarization of variant columns. In more detail, for
> anyone who may not already be familiar:
> >
> > 1. Variant data types
> > Variant types allow for the efficient binary encoding of dynamic
> semi-structured data such as JSON, Avro, etc. By encoding semi-structured
> data as a variant column, we retain the flexibility of the source data,
> while allowing query engines to more efficiently operate on the data.
> Snowflake has supported the variant data type on Snowflake tables for many
> years [1]. As more and more users utilize Iceberg tables in Snowflake,
> we’re hearing an increasing chorus of requests for variant support.
> Additionally, other query engines such as Apache Spark have begun adding
> variant support [2]. As such, we believe it would be beneficial to the
> Iceberg community as a whole to standardize on the variant data type
> encoding used across Iceberg tables.
> >
> > One specific point to make here is that, since an Apache OSS version of
> variant encoding already exists in Spark, it likely makes sense to simply
> adopt the Spark encoding as the Iceberg standard as well. The encoding we
> use internally today in Snowflake is slightly different, but essentially
> equivalent, and we see no particular value in trying to clutter the space
> with another equivalent-but-incompatible encoding.
> >
> >
> > 2. Subcolumnarization
> > Subcolumnarization of variant columns allows query engines to
> efficiently prune datasets when subcolumns (i.e., nested fields) within a
> variant column are queried, and also allows optionally materializing some
> of the nested fields as a column on their own, affording queries on these
> subcolumns the ability to read less data and spend less CPU on extraction.
> When subcolumnarizing, the system managing table metadata and data tracks
> individual pruning statistics (min, max, null, etc.) for some subset of the
> nested fields within a variant, and also manages any optional
> materialization. Without subcolumnarization, any query which touches a
> variant column must read, parse, extract, and filter every row for which
> that column is non-null. Thus, by providing a standardized way of tracking
> subcolum metadata and data for variant columns, Iceberg can make
> subcolumnar optimizations accessible across various catalogs and query
> engines.
> >
> > Subcolumnarization is a non-trivial topic, so we expect any concrete
> proposal to include not only the set of changes to Iceberg metadata that
> allow compatible query engines to interopate on subcolumnarization data for
> variant columns, but also reference documentation explaining
> subcolumnarization principles and recommended best practices.
> >
> >
> > It sounds like the recent Geo proposal [3] may be a good starting point
> for how to approach this, so our plan is to write something up in that vein
> that covers the proposed spec changes, backwards compatibility, implementor
> burdens, etc. But we wanted to first reach out to the community to
> introduce ourselves and the idea, and see if there’s any early feedback we
> should incorporate before we spend too much time on a concrete proposal.
> >
> > Thank you!
> >
> > [1]
> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
> > [2] https://github.com/apache/spark/blob/master/common/variant/README.md
> > [3]
> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
> >
> > -Tyler, Nileema, Selcuk, Aihua
> >
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to