Hi Aihua, Long time no see :) Would this mean, that every engine which plans to support Variant data type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? Thanks, Peter
On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: > Thanks Ryan. > > Yeah. That's another reason we want to pursue Spark encoding to keep > compatibility for the open source engines. > > One more question regarding the encoding implementation: do we have an > issue to directly use Spark implementation in Iceberg? Russell pointed out > that Trino doesn't have Spark dependency and that could be a problem? > > Thanks, > Aihua > > On 2024/07/12 15:02:06 Ryan Blue wrote: > > Thanks, Aihua! > > > > I think that the encoding choice in the current doc is a good one. I went > > through the Spark encoding in detail and it looks like a better choice > than > > the other candidate encodings for quickly accessing nested fields. > > > > Another reason to use the Spark type is that this is what Delta's variant > > type is based on, so Parquet files in tables written by Delta could be > > converted or used in Iceberg tables without needing to rewrite variant > > data. (Also, note that I work at Databricks and have an interest in > > increasing format compatibility.) > > > > Ryan > > > > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com > .invalid> > > wrote: > > > > > [Discuss] Consensus for Variant Encoding > > > > > > It’s great to be able to present the Variant type proposal in the > > > community sync yesterday and I’m looking to host a meeting next week > > > (targeting for 9am, July 17th) to go over any further concerns about > the > > > encoding of the Variant type and any other questions on the first > phase of > > > the proposal > > > < > https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit > >. > > > We are hoping that anyone who is interested in the proposal can either > join > > > or reply with their comments so we can discuss them. Summary of the > > > discussion and notes will be sent to the mailing list for further > comment > > > there. > > > > > > > > > - > > > > > > What should be the underlying binary representation > > > > > > We have evaluated a few encodings in the doc including ION, JSONB, and > > > Spark encoding.Choosing the underlying encoding is an important first > step > > > here and we believe we have general support for Spark’s Variant > encoding. > > > We would like to hear if anyone else has strong opinions in this space. > > > > > > > > > - > > > > > > Should we support multiple logical types or just Variant? Variant > vs. > > > Variant + JSON. > > > > > > This is to discuss what logical data type(s) to be supported in > Iceberg - > > > Variant only vs. Variant + JSON. Both types would share the same > underlying > > > encoding but would imply different limitations on engines working with > > > those types. > > > > > > From the sync up meeting, we are more favoring toward supporting > Variant > > > only and we want to have a consensus on the supported type(s). > > > > > > > > > - > > > > > > How should we move forward with Subcolumnization? > > > > > > Subcolumnization is an optimization for Variant type by separating out > > > subcolumns with their own metadata. This is not critical for choosing > the > > > initial encoding of the Variant type so we were hoping to gain > consensus on > > > leaving that for a follow up spec. > > > > > > > > > Thanks > > > > > > Aihua > > > > > > Meeting invite: > > > > > > Wednesday, July 17 · 9:00 – 10:00am > > > Time zone: America/Los_Angeles > > > Google Meet joining info > > > Video call link: https://meet.google.com/pbm-ovzn-aoq > > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# > > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 > > > > > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> > wrote: > > > > > >> Hello, > > >> > > >> We have drafted the proposal > > >> < > https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit > > > > >> for Variant data type. Please help review and comment. > > >> > > >> Thanks, > > >> Aihua > > >> > > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> wrote: > > >> > > >>> +10000 for a JSON/BSON type. We also had the same discussion > internally > > >>> and a JSON type would really play well with for example the SUPER > type in > > >>> Redshift: > > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, > and > > >>> can also provide better integration with the Trino JSON type. > > >>> > > >>> Looking forward to the proposal! > > >>> > > >>> Best, > > >>> Jack Ye > > >>> > > >>> > > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau > > >>> <tyler.aki...@snowflake.com.invalid> wrote: > > >>> > > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: > > >>>> > > >>>>> > We may need some guidance on just how many we need to look at; > > >>>>> > we were planning on Spark and Trino, but weren't sure how much > > >>>>> > further down the rabbit hole we needed to go。 > > >>>>> > > >>>>> There are some engines living outside the Java world. It would be > > >>>>> good if the proposal could cover the effort it takes to integrate > > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is > something > > >>>>> that > > >>>>> some proprietary iceberg vendors also care about. > > >>>>> > > >>>> > > >>>> Ack, makes sense. We can make sure to share some perspective on > this. > > >>>> > > >>>> > Not necessarily, no. As long as there's a binary type and Iceberg > and > > >>>>> > the query engines are aware that the binary column needs to be > > >>>>> > interpreted as a variant, that should be sufficient. > > >>>>> > > >>>>> From the perspective of interoperability, it would be good to > support > > >>>>> native > > >>>>> type from file specs. Life will be easier for projects like Apache > > >>>>> XTable. > > >>>>> File format could also provide finer-grained statistics for variant > > >>>>> type which > > >>>>> facilitates data skipping. > > >>>>> > > >>>> > > >>>> Agreed, there can definitely be additional value in native file > format > > >>>> integration. Just wanted to highlight that it's not a strict > requirement. > > >>>> > > >>>> -Tyler > > >>>> > > >>>> > > >>>>> > > >>>>> Gang > > >>>>> > > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau > > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: > > >>>>> > > >>>>>> Good to see you again as well, JB! Thanks! > > >>>>>> > > >>>>>> -Tyler > > >>>>>> > > >>>>>> > > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < > j...@nanthrax.net> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi Tyler, > > >>>>>>> > > >>>>>>> Super happy to see you there :) It reminds me our discussions > back in > > >>>>>>> the start of Apache Beam :) > > >>>>>>> > > >>>>>>> Anyway, the thread is pretty interesting. I remember some > discussions > > >>>>>>> about JSON datatype for spec v3. The binary data type is already > > >>>>>>> supported in the spec v2. > > >>>>>>> > > >>>>>>> I'm looking forward to the proposal and happy to help on this ! > > >>>>>>> > > >>>>>>> Regards > > >>>>>>> JB > > >>>>>>> > > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau > > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: > > >>>>>>> > > > >>>>>>> > Hello, > > >>>>>>> > > > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal > for > > >>>>>>> which we’d like to get early feedback from the community. As you > may know, > > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. > Having made > > >>>>>>> good progress on our own adoption of the Iceberg standard, we’re > now in a > > >>>>>>> position where there are features not yet supported in Iceberg > which we > > >>>>>>> think would be valuable for our users, and that we would like to > discuss > > >>>>>>> with and help contribute to the Iceberg community. > > >>>>>>> > > > >>>>>>> > The first two such features we’d like to discuss are in > support of > > >>>>>>> efficient querying of dynamically typed, semi-structured data: > variant data > > >>>>>>> types, and subcolumnarization of variant columns. In more > detail, for > > >>>>>>> anyone who may not already be familiar: > > >>>>>>> > > > >>>>>>> > 1. Variant data types > > >>>>>>> > Variant types allow for the efficient binary encoding of > dynamic > > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding > semi-structured > > >>>>>>> data as a variant column, we retain the flexibility of the > source data, > > >>>>>>> while allowing query engines to more efficiently operate on the > data. > > >>>>>>> Snowflake has supported the variant data type on Snowflake > tables for many > > >>>>>>> years [1]. As more and more users utilize Iceberg tables in > Snowflake, > > >>>>>>> we’re hearing an increasing chorus of requests for variant > support. > > >>>>>>> Additionally, other query engines such as Apache Spark have > begun adding > > >>>>>>> variant support [2]. As such, we believe it would be beneficial > to the > > >>>>>>> Iceberg community as a whole to standardize on the variant data > type > > >>>>>>> encoding used across Iceberg tables. > > >>>>>>> > > > >>>>>>> > One specific point to make here is that, since an Apache OSS > > >>>>>>> version of variant encoding already exists in Spark, it likely > makes sense > > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as > well. The > > >>>>>>> encoding we use internally today in Snowflake is slightly > different, but > > >>>>>>> essentially equivalent, and we see no particular value in trying > to clutter > > >>>>>>> the space with another equivalent-but-incompatible encoding. > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > 2. Subcolumnarization > > >>>>>>> > Subcolumnarization of variant columns allows query engines to > > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested fields) > within a > > >>>>>>> variant column are queried, and also allows optionally > materializing some > > >>>>>>> of the nested fields as a column on their own, affording queries > on these > > >>>>>>> subcolumns the ability to read less data and spend less CPU on > extraction. > > >>>>>>> When subcolumnarizing, the system managing table metadata and > data tracks > > >>>>>>> individual pruning statistics (min, max, null, etc.) for some > subset of the > > >>>>>>> nested fields within a variant, and also manages any optional > > >>>>>>> materialization. Without subcolumnarization, any query which > touches a > > >>>>>>> variant column must read, parse, extract, and filter every row > for which > > >>>>>>> that column is non-null. Thus, by providing a standardized way > of tracking > > >>>>>>> subcolum metadata and data for variant columns, Iceberg can make > > >>>>>>> subcolumnar optimizations accessible across various catalogs and > query > > >>>>>>> engines. > > >>>>>>> > > > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any > > >>>>>>> concrete proposal to include not only the set of changes to > Iceberg > > >>>>>>> metadata that allow compatible query engines to interopate on > > >>>>>>> subcolumnarization data for variant columns, but also reference > > >>>>>>> documentation explaining subcolumnarization principles and > recommended best > > >>>>>>> practices. > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good > starting > > >>>>>>> point for how to approach this, so our plan is to write > something up in > > >>>>>>> that vein that covers the proposed spec changes, backwards > compatibility, > > >>>>>>> implementor burdens, etc. But we wanted to first reach out to > the community > > >>>>>>> to introduce ourselves and the idea, and see if there’s any > early feedback > > >>>>>>> we should incorporate before we spend too much time on a > concrete proposal. > > >>>>>>> > > > >>>>>>> > Thank you! > > >>>>>>> > > > >>>>>>> > [1] > > >>>>>>> > https://docs.snowflake.com/en/sql-reference/data-types-semistructured > > >>>>>>> > [2] > > >>>>>>> > https://github.com/apache/spark/blob/master/common/variant/README.md > > >>>>>>> > [3] > > >>>>>>> > https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit > > >>>>>>> > > > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua > > >>>>>>> > > > >>>>>>> > > >>>>>> > > > > -- > > Ryan Blue > > Databricks > > >