Sounds like a good idea. Looking forward to a proposal. Yufei
On Sat, May 11, 2024 at 9:27 AM Amogh Jahagirdar <am...@tabular.io> wrote: > Hi all, > > Thanks for raising this thread! Overall I agree that some forms of variant > types would be useful in Iceberg and we've seen interest in support for > that as well. Specifically, there has been discussion on adding an > optimized JSON data types as part of the V3 spec > <https://github.com/apache/iceberg/issues/9066>. > I have some questions on some of the points that were brought up: > > > One specific point to make here is that, since an Apache OSS version of > variant encoding already exists in Spark, it likely makes sense to simply > adopt the Spark encoding as the Iceberg standard as well. The encoding we > use internally today in Snowflake is slightly different, but essentially > equivalent, and we see no particular value in trying to clutter the space > with another equivalent-but-incompatible encoding. > > 1.) Isn't the variant encoding in Spark essentially a binary data type > (i.e. an array of arbitrary bytes)? Iceberg already has a binary data > type <https://iceberg.apache.org/spec/#primitive-types>defined as well. I > would think that we would want to spec out something along the lines of a > JSONB data type (how are keys organized in the structure, null values etc). > It's worth looking at what systems like Postgres do. > In general, I think we would want to choose an encoding scheme which has > good compression characteristics that balances with the decoding/reading > performance. We do want to make engine integration easy as possible (so > that common json functions that engines expose) > > 2.) I'm less familiar with subcolumnarization, but it sounds like a > technique for pruning and efficient materialization of nested fields which > is very interesting, but > I think I'd probably try and get consensus on the specific data types we > want to add, and spec those out first, with integration considerations > (Spark, Trino, Python, etc). I think there's enough to unpack there that's > worth a discussion, before jumping into more complex optimizations. Certain > cases may be inefficient without this technique but I think it's quite > useful to at least start out with the data type definitions and engine > integrations. We could then possibly look at support for this technique > afterwards. Wonder what others think though. > > > so our plan is to write something up in that vein that covers the > proposed spec changes, backwards compatibility, implementor burdens, etc. > 3.) Minor comment, since I'm assuming this is considered as > "implementation burdens" but I think a proposal should also ideally cover > any relevant details for integrations with Spark, Trino, PyIceberg, etc. At > least to make sure the community is aware of any challenges in integration > (if there's any), so we can make progress in those areas and get adoption > for these new data types! > > Overall this is very exciting! > > Thanks, > > Amogh Jahagirdar > > On Fri, May 10, 2024 at 11:28 PM Gang Wu <ust...@gmail.com> wrote: > >> Hi, >> >> This sounds very interesting! >> >> IIUC, the current variant type in the Apache Spark stores data in the >> BINARY type. When it comes to subcolumnarization, does it require the file >> format (e.g. Apache Parquet/ORC/Avro) to support variant type natively? >> >> Best, >> Gang >> >> On Sat, May 11, 2024 at 1:07 PM Tyler Akidau >> <tyler.aki...@snowflake.com.invalid> wrote: >> >>> Hello, >>> >>> We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal for which >>> we’d like to get early feedback from the community. As you may know, >>> Snowflake has embraced Iceberg as its open Data Lake format. Having made >>> good progress on our own adoption of the Iceberg standard, we’re now in a >>> position where there are features not yet supported in Iceberg which we >>> think would be valuable for our users, and that we would like to discuss >>> with and help contribute to the Iceberg community. >>> >>> The first two such features we’d like to discuss are in support of >>> efficient querying of dynamically typed, semi-structured data: variant data >>> types, and subcolumnarization of variant columns. In more detail, for >>> anyone who may not already be familiar: >>> >>> 1. Variant data types >>> Variant types allow for the efficient binary encoding of dynamic >>> semi-structured data such as JSON, Avro, etc. By encoding semi-structured >>> data as a variant column, we retain the flexibility of the source data, >>> while allowing query engines to more efficiently operate on the data. >>> Snowflake has supported the variant data type on Snowflake tables for many >>> years [1]. As more and more users utilize Iceberg tables in Snowflake, >>> we’re hearing an increasing chorus of requests for variant support. >>> Additionally, other query engines such as Apache Spark have begun adding >>> variant support [2]. As such, we believe it would be beneficial to the >>> Iceberg community as a whole to standardize on the variant data type >>> encoding used across Iceberg tables. >>> >>> One specific point to make here is that, since an Apache OSS version of >>> variant encoding already exists in Spark, it likely makes sense to simply >>> adopt the Spark encoding as the Iceberg standard as well. The encoding we >>> use internally today in Snowflake is slightly different, but essentially >>> equivalent, and we see no particular value in trying to clutter the space >>> with another equivalent-but-incompatible encoding. >>> >>> >>> 2. Subcolumnarization >>> Subcolumnarization of variant columns allows query engines to >>> efficiently prune datasets when subcolumns (i.e., nested fields) within a >>> variant column are queried, and also allows optionally materializing some >>> of the nested fields as a column on their own, affording queries on these >>> subcolumns the ability to read less data and spend less CPU on extraction. >>> When subcolumnarizing, the system managing table metadata and data tracks >>> individual pruning statistics (min, max, null, etc.) for some subset of the >>> nested fields within a variant, and also manages any optional >>> materialization. Without subcolumnarization, any query which touches a >>> variant column must read, parse, extract, and filter every row for which >>> that column is non-null. Thus, by providing a standardized way of tracking >>> subcolum metadata and data for variant columns, Iceberg can make >>> subcolumnar optimizations accessible across various catalogs and query >>> engines. >>> >>> Subcolumnarization is a non-trivial topic, so we expect any concrete >>> proposal to include not only the set of changes to Iceberg metadata that >>> allow compatible query engines to interopate on subcolumnarization data for >>> variant columns, but also reference documentation explaining >>> subcolumnarization principles and recommended best practices. >>> >>> >>> It sounds like the recent Geo proposal [3] may be a good starting point >>> for how to approach this, so our plan is to write something up in that vein >>> that covers the proposed spec changes, backwards compatibility, implementor >>> burdens, etc. But we wanted to first reach out to the community to >>> introduce ourselves and the idea, and see if there’s any early feedback we >>> should incorporate before we spend too much time on a concrete proposal. >>> >>> Thank you! >>> >>> [1] >>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>> [2] https://github.com/apache/spark/blob/master/common/variant/README.md >>> [3] >>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>> >>> -Tyler, Nileema, Selcuk, Aihua >>> >>>