Hello,

We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which we’d
like to get early feedback from the community. As you may know, Snowflake
has embraced Iceberg as its open Data Lake format. Having made good
progress on our own adoption of the Iceberg standard, we’re now in a
position where there are features not yet supported in Iceberg which we
think would be valuable for our users, and that we would like to discuss
with and help contribute to the Iceberg community.

The first two such features we’d like to discuss are in support of
efficient querying of dynamically typed, semi-structured data: variant data
types, and subcolumnarization of variant columns. In more detail, for
anyone who may not already be familiar:

1. Variant data types
Variant types allow for the efficient binary encoding of dynamic
semi-structured data such as JSON, Avro, etc. By encoding semi-structured
data as a variant column, we retain the flexibility of the source data,
while allowing query engines to more efficiently operate on the data.
Snowflake has supported the variant data type on Snowflake tables for many
years [1]. As more and more users utilize Iceberg tables in Snowflake,
we’re hearing an increasing chorus of requests for variant support.
Additionally, other query engines such as Apache Spark have begun adding
variant support [2]. As such, we believe it would be beneficial to the
Iceberg community as a whole to standardize on the variant data type
encoding used across Iceberg tables.

One specific point to make here is that, since an Apache OSS version of
variant encoding already exists in Spark, it likely makes sense to simply
adopt the Spark encoding as the Iceberg standard as well. The encoding we
use internally today in Snowflake is slightly different, but essentially
equivalent, and we see no particular value in trying to clutter the space
with another equivalent-but-incompatible encoding.


2. Subcolumnarization
Subcolumnarization of variant columns allows query engines to efficiently
prune datasets when subcolumns (i.e., nested fields) within a variant
column are queried, and also allows optionally materializing some of the
nested fields as a column on their own, affording queries on these
subcolumns the ability to read less data and spend less CPU on extraction.
When subcolumnarizing, the system managing table metadata and data tracks
individual pruning statistics (min, max, null, etc.) for some subset of the
nested fields within a variant, and also manages any optional
materialization. Without subcolumnarization, any query which touches a
variant column must read, parse, extract, and filter every row for which
that column is non-null. Thus, by providing a standardized way of tracking
subcolum metadata and data for variant columns, Iceberg can make
subcolumnar optimizations accessible across various catalogs and query
engines.

Subcolumnarization is a non-trivial topic, so we expect any concrete
proposal to include not only the set of changes to Iceberg metadata that
allow compatible query engines to interopate on subcolumnarization data for
variant columns, but also reference documentation explaining
subcolumnarization principles and recommended best practices.


It sounds like the recent Geo proposal [3] may be a good starting point for
how to approach this, so our plan is to write something up in that vein
that covers the proposed spec changes, backwards compatibility, implementor
burdens, etc. But we wanted to first reach out to the community to
introduce ourselves and the idea, and see if there’s any early feedback we
should incorporate before we spend too much time on a concrete proposal.

Thank you!

[1] https://docs.snowflake.com/en/sql-reference/data-types-semistructured
[2] https://github.com/apache/spark/blob/master/common/variant/README.md
[3]
https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit

-Tyler, Nileema, Selcuk, Aihua

Reply via email to