I raised the same point from Peter's email in a comment on the doc as well. There is a spark-variant_2.13 artifact that would be a much smaller scope than relying on large portions of Spark, but I even then I doubt that it is a good idea for Iceberg to depend on that because it is a Scala artifact and we would need to bring in a ton of Scala libs. I think what makes the most sense is to have an independent implementation of the spec in Iceberg.
On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Aihua, > Long time no see :) > Would this mean, that every engine which plans to support Variant data > type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? > Thanks, Peter > > > On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: > >> Thanks Ryan. >> >> Yeah. That's another reason we want to pursue Spark encoding to keep >> compatibility for the open source engines. >> >> One more question regarding the encoding implementation: do we have an >> issue to directly use Spark implementation in Iceberg? Russell pointed out >> that Trino doesn't have Spark dependency and that could be a problem? >> >> Thanks, >> Aihua >> >> On 2024/07/12 15:02:06 Ryan Blue wrote: >> > Thanks, Aihua! >> > >> > I think that the encoding choice in the current doc is a good one. I >> went >> > through the Spark encoding in detail and it looks like a better choice >> than >> > the other candidate encodings for quickly accessing nested fields. >> > >> > Another reason to use the Spark type is that this is what Delta's >> variant >> > type is based on, so Parquet files in tables written by Delta could be >> > converted or used in Iceberg tables without needing to rewrite variant >> > data. (Also, note that I work at Databricks and have an interest in >> > increasing format compatibility.) >> > >> > Ryan >> > >> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com >> .invalid> >> > wrote: >> > >> > > [Discuss] Consensus for Variant Encoding >> > > >> > > It’s great to be able to present the Variant type proposal in the >> > > community sync yesterday and I’m looking to host a meeting next week >> > > (targeting for 9am, July 17th) to go over any further concerns about >> the >> > > encoding of the Variant type and any other questions on the first >> phase of >> > > the proposal >> > > < >> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >> >. >> > > We are hoping that anyone who is interested in the proposal can >> either join >> > > or reply with their comments so we can discuss them. Summary of the >> > > discussion and notes will be sent to the mailing list for further >> comment >> > > there. >> > > >> > > >> > > - >> > > >> > > What should be the underlying binary representation >> > > >> > > We have evaluated a few encodings in the doc including ION, JSONB, and >> > > Spark encoding.Choosing the underlying encoding is an important first >> step >> > > here and we believe we have general support for Spark’s Variant >> encoding. >> > > We would like to hear if anyone else has strong opinions in this >> space. >> > > >> > > >> > > - >> > > >> > > Should we support multiple logical types or just Variant? Variant >> vs. >> > > Variant + JSON. >> > > >> > > This is to discuss what logical data type(s) to be supported in >> Iceberg - >> > > Variant only vs. Variant + JSON. Both types would share the same >> underlying >> > > encoding but would imply different limitations on engines working with >> > > those types. >> > > >> > > From the sync up meeting, we are more favoring toward supporting >> Variant >> > > only and we want to have a consensus on the supported type(s). >> > > >> > > >> > > - >> > > >> > > How should we move forward with Subcolumnization? >> > > >> > > Subcolumnization is an optimization for Variant type by separating out >> > > subcolumns with their own metadata. This is not critical for choosing >> the >> > > initial encoding of the Variant type so we were hoping to gain >> consensus on >> > > leaving that for a follow up spec. >> > > >> > > >> > > Thanks >> > > >> > > Aihua >> > > >> > > Meeting invite: >> > > >> > > Wednesday, July 17 · 9:00 – 10:00am >> > > Time zone: America/Los_Angeles >> > > Google Meet joining info >> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >> > > >> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> >> wrote: >> > > >> > >> Hello, >> > >> >> > >> We have drafted the proposal >> > >> < >> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >> > >> > >> for Variant data type. Please help review and comment. >> > >> >> > >> Thanks, >> > >> Aihua >> > >> >> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> >> wrote: >> > >> >> > >>> +10000 for a JSON/BSON type. We also had the same discussion >> internally >> > >>> and a JSON type would really play well with for example the SUPER >> type in >> > >>> Redshift: >> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >> and >> > >>> can also provide better integration with the Trino JSON type. >> > >>> >> > >>> Looking forward to the proposal! >> > >>> >> > >>> Best, >> > >>> Jack Ye >> > >>> >> > >>> >> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >> > >>> >> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: >> > >>>> >> > >>>>> > We may need some guidance on just how many we need to look at; >> > >>>>> > we were planning on Spark and Trino, but weren't sure how much >> > >>>>> > further down the rabbit hole we needed to go。 >> > >>>>> >> > >>>>> There are some engines living outside the Java world. It would be >> > >>>>> good if the proposal could cover the effort it takes to integrate >> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is >> something >> > >>>>> that >> > >>>>> some proprietary iceberg vendors also care about. >> > >>>>> >> > >>>> >> > >>>> Ack, makes sense. We can make sure to share some perspective on >> this. >> > >>>> >> > >>>> > Not necessarily, no. As long as there's a binary type and >> Iceberg and >> > >>>>> > the query engines are aware that the binary column needs to be >> > >>>>> > interpreted as a variant, that should be sufficient. >> > >>>>> >> > >>>>> From the perspective of interoperability, it would be good to >> support >> > >>>>> native >> > >>>>> type from file specs. Life will be easier for projects like Apache >> > >>>>> XTable. >> > >>>>> File format could also provide finer-grained statistics for >> variant >> > >>>>> type which >> > >>>>> facilitates data skipping. >> > >>>>> >> > >>>> >> > >>>> Agreed, there can definitely be additional value in native file >> format >> > >>>> integration. Just wanted to highlight that it's not a strict >> requirement. >> > >>>> >> > >>>> -Tyler >> > >>>> >> > >>>> >> > >>>>> >> > >>>>> Gang >> > >>>>> >> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >> > >>>>> >> > >>>>>> Good to see you again as well, JB! Thanks! >> > >>>>>> >> > >>>>>> -Tyler >> > >>>>>> >> > >>>>>> >> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < >> j...@nanthrax.net> >> > >>>>>> wrote: >> > >>>>>> >> > >>>>>>> Hi Tyler, >> > >>>>>>> >> > >>>>>>> Super happy to see you there :) It reminds me our discussions >> back in >> > >>>>>>> the start of Apache Beam :) >> > >>>>>>> >> > >>>>>>> Anyway, the thread is pretty interesting. I remember some >> discussions >> > >>>>>>> about JSON datatype for spec v3. The binary data type is already >> > >>>>>>> supported in the spec v2. >> > >>>>>>> >> > >>>>>>> I'm looking forward to the proposal and happy to help on this ! >> > >>>>>>> >> > >>>>>>> Regards >> > >>>>>>> JB >> > >>>>>>> >> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >> > >>>>>>> > >> > >>>>>>> > Hello, >> > >>>>>>> > >> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal >> for >> > >>>>>>> which we’d like to get early feedback from the community. As >> you may know, >> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. >> Having made >> > >>>>>>> good progress on our own adoption of the Iceberg standard, >> we’re now in a >> > >>>>>>> position where there are features not yet supported in Iceberg >> which we >> > >>>>>>> think would be valuable for our users, and that we would like >> to discuss >> > >>>>>>> with and help contribute to the Iceberg community. >> > >>>>>>> > >> > >>>>>>> > The first two such features we’d like to discuss are in >> support of >> > >>>>>>> efficient querying of dynamically typed, semi-structured data: >> variant data >> > >>>>>>> types, and subcolumnarization of variant columns. In more >> detail, for >> > >>>>>>> anyone who may not already be familiar: >> > >>>>>>> > >> > >>>>>>> > 1. Variant data types >> > >>>>>>> > Variant types allow for the efficient binary encoding of >> dynamic >> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding >> semi-structured >> > >>>>>>> data as a variant column, we retain the flexibility of the >> source data, >> > >>>>>>> while allowing query engines to more efficiently operate on the >> data. >> > >>>>>>> Snowflake has supported the variant data type on Snowflake >> tables for many >> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in >> Snowflake, >> > >>>>>>> we’re hearing an increasing chorus of requests for variant >> support. >> > >>>>>>> Additionally, other query engines such as Apache Spark have >> begun adding >> > >>>>>>> variant support [2]. As such, we believe it would be beneficial >> to the >> > >>>>>>> Iceberg community as a whole to standardize on the variant data >> type >> > >>>>>>> encoding used across Iceberg tables. >> > >>>>>>> > >> > >>>>>>> > One specific point to make here is that, since an Apache OSS >> > >>>>>>> version of variant encoding already exists in Spark, it likely >> makes sense >> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as >> well. The >> > >>>>>>> encoding we use internally today in Snowflake is slightly >> different, but >> > >>>>>>> essentially equivalent, and we see no particular value in >> trying to clutter >> > >>>>>>> the space with another equivalent-but-incompatible encoding. >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > 2. Subcolumnarization >> > >>>>>>> > Subcolumnarization of variant columns allows query engines to >> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested >> fields) within a >> > >>>>>>> variant column are queried, and also allows optionally >> materializing some >> > >>>>>>> of the nested fields as a column on their own, affording >> queries on these >> > >>>>>>> subcolumns the ability to read less data and spend less CPU on >> extraction. >> > >>>>>>> When subcolumnarizing, the system managing table metadata and >> data tracks >> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some >> subset of the >> > >>>>>>> nested fields within a variant, and also manages any optional >> > >>>>>>> materialization. Without subcolumnarization, any query which >> touches a >> > >>>>>>> variant column must read, parse, extract, and filter every row >> for which >> > >>>>>>> that column is non-null. Thus, by providing a standardized way >> of tracking >> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can make >> > >>>>>>> subcolumnar optimizations accessible across various catalogs >> and query >> > >>>>>>> engines. >> > >>>>>>> > >> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any >> > >>>>>>> concrete proposal to include not only the set of changes to >> Iceberg >> > >>>>>>> metadata that allow compatible query engines to interopate on >> > >>>>>>> subcolumnarization data for variant columns, but also reference >> > >>>>>>> documentation explaining subcolumnarization principles and >> recommended best >> > >>>>>>> practices. >> > >>>>>>> > >> > >>>>>>> > >> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good >> starting >> > >>>>>>> point for how to approach this, so our plan is to write >> something up in >> > >>>>>>> that vein that covers the proposed spec changes, backwards >> compatibility, >> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to >> the community >> > >>>>>>> to introduce ourselves and the idea, and see if there’s any >> early feedback >> > >>>>>>> we should incorporate before we spend too much time on a >> concrete proposal. >> > >>>>>>> > >> > >>>>>>> > Thank you! >> > >>>>>>> > >> > >>>>>>> > [1] >> > >>>>>>> >> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >> > >>>>>>> > [2] >> > >>>>>>> >> https://github.com/apache/spark/blob/master/common/variant/README.md >> > >>>>>>> > [3] >> > >>>>>>> >> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >> > >>>>>>> > >> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >> > >>>>>>> > >> > >>>>>>> >> > >>>>>> >> > >> > -- >> > Ryan Blue >> > Databricks >> > >> > -- Ryan Blue Databricks