Oops, it looks like I missed where Aihua brought this up in his last email:
> do we have an issue to directly use Spark implementation in Iceberg? Yes, I think that we do have an issue using the Spark library. What do you think about a Java implementation in Iceberg? Ryan On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote: > I raised the same point from Peter's email in a comment on the doc as > well. There is a spark-variant_2.13 artifact that would be a much smaller > scope than relying on large portions of Spark, but I even then I doubt that > it is a good idea for Iceberg to depend on that because it is a Scala > artifact and we would need to bring in a ton of Scala libs. I think what > makes the most sense is to have an independent implementation of the spec > in Iceberg. > > On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Hi Aihua, >> Long time no see :) >> Would this mean, that every engine which plans to support Variant data >> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? >> Thanks, Peter >> >> >> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: >> >>> Thanks Ryan. >>> >>> Yeah. That's another reason we want to pursue Spark encoding to keep >>> compatibility for the open source engines. >>> >>> One more question regarding the encoding implementation: do we have an >>> issue to directly use Spark implementation in Iceberg? Russell pointed out >>> that Trino doesn't have Spark dependency and that could be a problem? >>> >>> Thanks, >>> Aihua >>> >>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>> > Thanks, Aihua! >>> > >>> > I think that the encoding choice in the current doc is a good one. I >>> went >>> > through the Spark encoding in detail and it looks like a better choice >>> than >>> > the other candidate encodings for quickly accessing nested fields. >>> > >>> > Another reason to use the Spark type is that this is what Delta's >>> variant >>> > type is based on, so Parquet files in tables written by Delta could be >>> > converted or used in Iceberg tables without needing to rewrite variant >>> > data. (Also, note that I work at Databricks and have an interest in >>> > increasing format compatibility.) >>> > >>> > Ryan >>> > >>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com >>> .invalid> >>> > wrote: >>> > >>> > > [Discuss] Consensus for Variant Encoding >>> > > >>> > > It’s great to be able to present the Variant type proposal in the >>> > > community sync yesterday and I’m looking to host a meeting next week >>> > > (targeting for 9am, July 17th) to go over any further concerns about >>> the >>> > > encoding of the Variant type and any other questions on the first >>> phase of >>> > > the proposal >>> > > < >>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>> >. >>> > > We are hoping that anyone who is interested in the proposal can >>> either join >>> > > or reply with their comments so we can discuss them. Summary of the >>> > > discussion and notes will be sent to the mailing list for further >>> comment >>> > > there. >>> > > >>> > > >>> > > - >>> > > >>> > > What should be the underlying binary representation >>> > > >>> > > We have evaluated a few encodings in the doc including ION, JSONB, >>> and >>> > > Spark encoding.Choosing the underlying encoding is an important >>> first step >>> > > here and we believe we have general support for Spark’s Variant >>> encoding. >>> > > We would like to hear if anyone else has strong opinions in this >>> space. >>> > > >>> > > >>> > > - >>> > > >>> > > Should we support multiple logical types or just Variant? Variant >>> vs. >>> > > Variant + JSON. >>> > > >>> > > This is to discuss what logical data type(s) to be supported in >>> Iceberg - >>> > > Variant only vs. Variant + JSON. Both types would share the same >>> underlying >>> > > encoding but would imply different limitations on engines working >>> with >>> > > those types. >>> > > >>> > > From the sync up meeting, we are more favoring toward supporting >>> Variant >>> > > only and we want to have a consensus on the supported type(s). >>> > > >>> > > >>> > > - >>> > > >>> > > How should we move forward with Subcolumnization? >>> > > >>> > > Subcolumnization is an optimization for Variant type by separating >>> out >>> > > subcolumns with their own metadata. This is not critical for >>> choosing the >>> > > initial encoding of the Variant type so we were hoping to gain >>> consensus on >>> > > leaving that for a follow up spec. >>> > > >>> > > >>> > > Thanks >>> > > >>> > > Aihua >>> > > >>> > > Meeting invite: >>> > > >>> > > Wednesday, July 17 · 9:00 – 10:00am >>> > > Time zone: America/Los_Angeles >>> > > Google Meet joining info >>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>> > > >>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> >>> wrote: >>> > > >>> > >> Hello, >>> > >> >>> > >> We have drafted the proposal >>> > >> < >>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>> > >>> > >> for Variant data type. Please help review and comment. >>> > >> >>> > >> Thanks, >>> > >> Aihua >>> > >> >>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> >>> wrote: >>> > >> >>> > >>> +10000 for a JSON/BSON type. We also had the same discussion >>> internally >>> > >>> and a JSON type would really play well with for example the SUPER >>> type in >>> > >>> Redshift: >>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, >>> and >>> > >>> can also provide better integration with the Trino JSON type. >>> > >>> >>> > >>> Looking forward to the proposal! >>> > >>> >>> > >>> Best, >>> > >>> Jack Ye >>> > >>> >>> > >>> >>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>> > >>> >>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> wrote: >>> > >>>> >>> > >>>>> > We may need some guidance on just how many we need to look at; >>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much >>> > >>>>> > further down the rabbit hole we needed to go。 >>> > >>>>> >>> > >>>>> There are some engines living outside the Java world. It would be >>> > >>>>> good if the proposal could cover the effort it takes to integrate >>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is >>> something >>> > >>>>> that >>> > >>>>> some proprietary iceberg vendors also care about. >>> > >>>>> >>> > >>>> >>> > >>>> Ack, makes sense. We can make sure to share some perspective on >>> this. >>> > >>>> >>> > >>>> > Not necessarily, no. As long as there's a binary type and >>> Iceberg and >>> > >>>>> > the query engines are aware that the binary column needs to be >>> > >>>>> > interpreted as a variant, that should be sufficient. >>> > >>>>> >>> > >>>>> From the perspective of interoperability, it would be good to >>> support >>> > >>>>> native >>> > >>>>> type from file specs. Life will be easier for projects like >>> Apache >>> > >>>>> XTable. >>> > >>>>> File format could also provide finer-grained statistics for >>> variant >>> > >>>>> type which >>> > >>>>> facilitates data skipping. >>> > >>>>> >>> > >>>> >>> > >>>> Agreed, there can definitely be additional value in native file >>> format >>> > >>>> integration. Just wanted to highlight that it's not a strict >>> requirement. >>> > >>>> >>> > >>>> -Tyler >>> > >>>> >>> > >>>> >>> > >>>>> >>> > >>>>> Gang >>> > >>>>> >>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>> > >>>>> >>> > >>>>>> Good to see you again as well, JB! Thanks! >>> > >>>>>> >>> > >>>>>> -Tyler >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < >>> j...@nanthrax.net> >>> > >>>>>> wrote: >>> > >>>>>> >>> > >>>>>>> Hi Tyler, >>> > >>>>>>> >>> > >>>>>>> Super happy to see you there :) It reminds me our discussions >>> back in >>> > >>>>>>> the start of Apache Beam :) >>> > >>>>>>> >>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some >>> discussions >>> > >>>>>>> about JSON datatype for spec v3. The binary data type is >>> already >>> > >>>>>>> supported in the spec v2. >>> > >>>>>>> >>> > >>>>>>> I'm looking forward to the proposal and happy to help on this ! >>> > >>>>>>> >>> > >>>>>>> Regards >>> > >>>>>>> JB >>> > >>>>>>> >>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>> > >>>>>>> > >>> > >>>>>>> > Hello, >>> > >>>>>>> > >>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal >>> for >>> > >>>>>>> which we’d like to get early feedback from the community. As >>> you may know, >>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format. >>> Having made >>> > >>>>>>> good progress on our own adoption of the Iceberg standard, >>> we’re now in a >>> > >>>>>>> position where there are features not yet supported in Iceberg >>> which we >>> > >>>>>>> think would be valuable for our users, and that we would like >>> to discuss >>> > >>>>>>> with and help contribute to the Iceberg community. >>> > >>>>>>> > >>> > >>>>>>> > The first two such features we’d like to discuss are in >>> support of >>> > >>>>>>> efficient querying of dynamically typed, semi-structured data: >>> variant data >>> > >>>>>>> types, and subcolumnarization of variant columns. In more >>> detail, for >>> > >>>>>>> anyone who may not already be familiar: >>> > >>>>>>> > >>> > >>>>>>> > 1. Variant data types >>> > >>>>>>> > Variant types allow for the efficient binary encoding of >>> dynamic >>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding >>> semi-structured >>> > >>>>>>> data as a variant column, we retain the flexibility of the >>> source data, >>> > >>>>>>> while allowing query engines to more efficiently operate on >>> the data. >>> > >>>>>>> Snowflake has supported the variant data type on Snowflake >>> tables for many >>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in >>> Snowflake, >>> > >>>>>>> we’re hearing an increasing chorus of requests for variant >>> support. >>> > >>>>>>> Additionally, other query engines such as Apache Spark have >>> begun adding >>> > >>>>>>> variant support [2]. As such, we believe it would be >>> beneficial to the >>> > >>>>>>> Iceberg community as a whole to standardize on the variant >>> data type >>> > >>>>>>> encoding used across Iceberg tables. >>> > >>>>>>> > >>> > >>>>>>> > One specific point to make here is that, since an Apache OSS >>> > >>>>>>> version of variant encoding already exists in Spark, it likely >>> makes sense >>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as >>> well. The >>> > >>>>>>> encoding we use internally today in Snowflake is slightly >>> different, but >>> > >>>>>>> essentially equivalent, and we see no particular value in >>> trying to clutter >>> > >>>>>>> the space with another equivalent-but-incompatible encoding. >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > 2. Subcolumnarization >>> > >>>>>>> > Subcolumnarization of variant columns allows query engines to >>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested >>> fields) within a >>> > >>>>>>> variant column are queried, and also allows optionally >>> materializing some >>> > >>>>>>> of the nested fields as a column on their own, affording >>> queries on these >>> > >>>>>>> subcolumns the ability to read less data and spend less CPU on >>> extraction. >>> > >>>>>>> When subcolumnarizing, the system managing table metadata and >>> data tracks >>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some >>> subset of the >>> > >>>>>>> nested fields within a variant, and also manages any optional >>> > >>>>>>> materialization. Without subcolumnarization, any query which >>> touches a >>> > >>>>>>> variant column must read, parse, extract, and filter every row >>> for which >>> > >>>>>>> that column is non-null. Thus, by providing a standardized way >>> of tracking >>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can >>> make >>> > >>>>>>> subcolumnar optimizations accessible across various catalogs >>> and query >>> > >>>>>>> engines. >>> > >>>>>>> > >>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any >>> > >>>>>>> concrete proposal to include not only the set of changes to >>> Iceberg >>> > >>>>>>> metadata that allow compatible query engines to interopate on >>> > >>>>>>> subcolumnarization data for variant columns, but also reference >>> > >>>>>>> documentation explaining subcolumnarization principles and >>> recommended best >>> > >>>>>>> practices. >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good >>> starting >>> > >>>>>>> point for how to approach this, so our plan is to write >>> something up in >>> > >>>>>>> that vein that covers the proposed spec changes, backwards >>> compatibility, >>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to >>> the community >>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any >>> early feedback >>> > >>>>>>> we should incorporate before we spend too much time on a >>> concrete proposal. >>> > >>>>>>> > >>> > >>>>>>> > Thank you! >>> > >>>>>>> > >>> > >>>>>>> > [1] >>> > >>>>>>> >>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>> > >>>>>>> > [2] >>> > >>>>>>> >>> https://github.com/apache/spark/blob/master/common/variant/README.md >>> > >>>>>>> > [3] >>> > >>>>>>> >>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>> > >>>>>>> > >>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>> > >>>>>>> > >>> > >>>>>>> >>> > >>>>>> >>> > >>> > -- >>> > Ryan Blue >>> > Databricks >>> > >>> >> > > -- > Ryan Blue > Databricks > -- Ryan Blue Databricks