That's fair, I'm sold on an Iceberg Module. On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue <b...@databricks.com.invalid> wrote:
> > Feels like eventually the encoding should land in parquet proper right? > > What about using it in ORC? I don't know where it should end up. Maybe > Iceberg should make a standalone module from it? > > On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> Feels like eventually the encoding should land in parquet proper right? >> I'm fine with us just copying into Iceberg though for the time being. >> >> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> Oops, it looks like I missed where Aihua brought this up in his last >>> email: >>> >>> > do we have an issue to directly use Spark implementation in Iceberg? >>> >>> Yes, I think that we do have an issue using the Spark library. What do >>> you think about a Java implementation in Iceberg? >>> >>> Ryan >>> >>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <b...@databricks.com> wrote: >>> >>>> I raised the same point from Peter's email in a comment on the doc as >>>> well. There is a spark-variant_2.13 artifact that would be a much smaller >>>> scope than relying on large portions of Spark, but I even then I doubt that >>>> it is a good idea for Iceberg to depend on that because it is a Scala >>>> artifact and we would need to bring in a ton of Scala libs. I think what >>>> makes the most sense is to have an independent implementation of the spec >>>> in Iceberg. >>>> >>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry < >>>> peter.vary.apa...@gmail.com> wrote: >>>> >>>>> Hi Aihua, >>>>> Long time no see :) >>>>> Would this mean, that every engine which plans to support Variant data >>>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc? >>>>> Thanks, Peter >>>>> >>>>> >>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <aihu...@apache.org> wrote: >>>>> >>>>>> Thanks Ryan. >>>>>> >>>>>> Yeah. That's another reason we want to pursue Spark encoding to keep >>>>>> compatibility for the open source engines. >>>>>> >>>>>> One more question regarding the encoding implementation: do we have >>>>>> an issue to directly use Spark implementation in Iceberg? Russell pointed >>>>>> out that Trino doesn't have Spark dependency and that could be a problem? >>>>>> >>>>>> Thanks, >>>>>> Aihua >>>>>> >>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote: >>>>>> > Thanks, Aihua! >>>>>> > >>>>>> > I think that the encoding choice in the current doc is a good one. >>>>>> I went >>>>>> > through the Spark encoding in detail and it looks like a better >>>>>> choice than >>>>>> > the other candidate encodings for quickly accessing nested fields. >>>>>> > >>>>>> > Another reason to use the Spark type is that this is what Delta's >>>>>> variant >>>>>> > type is based on, so Parquet files in tables written by Delta could >>>>>> be >>>>>> > converted or used in Iceberg tables without needing to rewrite >>>>>> variant >>>>>> > data. (Also, note that I work at Databricks and have an interest in >>>>>> > increasing format compatibility.) >>>>>> > >>>>>> > Ryan >>>>>> > >>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <aihua...@snowflake.com >>>>>> .invalid> >>>>>> > wrote: >>>>>> > >>>>>> > > [Discuss] Consensus for Variant Encoding >>>>>> > > >>>>>> > > It’s great to be able to present the Variant type proposal in the >>>>>> > > community sync yesterday and I’m looking to host a meeting next >>>>>> week >>>>>> > > (targeting for 9am, July 17th) to go over any further concerns >>>>>> about the >>>>>> > > encoding of the Variant type and any other questions on the first >>>>>> phase of >>>>>> > > the proposal >>>>>> > > < >>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>> >. >>>>>> > > We are hoping that anyone who is interested in the proposal can >>>>>> either join >>>>>> > > or reply with their comments so we can discuss them. Summary of >>>>>> the >>>>>> > > discussion and notes will be sent to the mailing list for further >>>>>> comment >>>>>> > > there. >>>>>> > > >>>>>> > > >>>>>> > > - >>>>>> > > >>>>>> > > What should be the underlying binary representation >>>>>> > > >>>>>> > > We have evaluated a few encodings in the doc including ION, >>>>>> JSONB, and >>>>>> > > Spark encoding.Choosing the underlying encoding is an important >>>>>> first step >>>>>> > > here and we believe we have general support for Spark’s Variant >>>>>> encoding. >>>>>> > > We would like to hear if anyone else has strong opinions in this >>>>>> space. >>>>>> > > >>>>>> > > >>>>>> > > - >>>>>> > > >>>>>> > > Should we support multiple logical types or just Variant? >>>>>> Variant vs. >>>>>> > > Variant + JSON. >>>>>> > > >>>>>> > > This is to discuss what logical data type(s) to be supported in >>>>>> Iceberg - >>>>>> > > Variant only vs. Variant + JSON. Both types would share the same >>>>>> underlying >>>>>> > > encoding but would imply different limitations on engines working >>>>>> with >>>>>> > > those types. >>>>>> > > >>>>>> > > From the sync up meeting, we are more favoring toward supporting >>>>>> Variant >>>>>> > > only and we want to have a consensus on the supported type(s). >>>>>> > > >>>>>> > > >>>>>> > > - >>>>>> > > >>>>>> > > How should we move forward with Subcolumnization? >>>>>> > > >>>>>> > > Subcolumnization is an optimization for Variant type by >>>>>> separating out >>>>>> > > subcolumns with their own metadata. This is not critical for >>>>>> choosing the >>>>>> > > initial encoding of the Variant type so we were hoping to gain >>>>>> consensus on >>>>>> > > leaving that for a follow up spec. >>>>>> > > >>>>>> > > >>>>>> > > Thanks >>>>>> > > >>>>>> > > Aihua >>>>>> > > >>>>>> > > Meeting invite: >>>>>> > > >>>>>> > > Wednesday, July 17 · 9:00 – 10:00am >>>>>> > > Time zone: America/Los_Angeles >>>>>> > > Google Meet joining info >>>>>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq >>>>>> > > Or dial: (US) +1 650-449-9343 PIN: 170 576 525# >>>>>> > > More phone numbers: >>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790 >>>>>> > > >>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <aihua...@snowflake.com> >>>>>> wrote: >>>>>> > > >>>>>> > >> Hello, >>>>>> > >> >>>>>> > >> We have drafted the proposal >>>>>> > >> < >>>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit >>>>>> > >>>>>> > >> for Variant data type. Please help review and comment. >>>>>> > >> >>>>>> > >> Thanks, >>>>>> > >> Aihua >>>>>> > >> >>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <yezhao...@gmail.com> >>>>>> wrote: >>>>>> > >> >>>>>> > >>> +10000 for a JSON/BSON type. We also had the same discussion >>>>>> internally >>>>>> > >>> and a JSON type would really play well with for example the >>>>>> SUPER type in >>>>>> > >>> Redshift: >>>>>> > >>> >>>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html, and >>>>>> > >>> can also provide better integration with the Trino JSON type. >>>>>> > >>> >>>>>> > >>> Looking forward to the proposal! >>>>>> > >>> >>>>>> > >>> Best, >>>>>> > >>> Jack Ye >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau >>>>>> > >>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>> > >>> >>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <ust...@gmail.com> >>>>>> wrote: >>>>>> > >>>> >>>>>> > >>>>> > We may need some guidance on just how many we need to look >>>>>> at; >>>>>> > >>>>> > we were planning on Spark and Trino, but weren't sure how >>>>>> much >>>>>> > >>>>> > further down the rabbit hole we needed to go。 >>>>>> > >>>>> >>>>>> > >>>>> There are some engines living outside the Java world. It >>>>>> would be >>>>>> > >>>>> good if the proposal could cover the effort it takes to >>>>>> integrate >>>>>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is >>>>>> something >>>>>> > >>>>> that >>>>>> > >>>>> some proprietary iceberg vendors also care about. >>>>>> > >>>>> >>>>>> > >>>> >>>>>> > >>>> Ack, makes sense. We can make sure to share some perspective >>>>>> on this. >>>>>> > >>>> >>>>>> > >>>> > Not necessarily, no. As long as there's a binary type and >>>>>> Iceberg and >>>>>> > >>>>> > the query engines are aware that the binary column needs to >>>>>> be >>>>>> > >>>>> > interpreted as a variant, that should be sufficient. >>>>>> > >>>>> >>>>>> > >>>>> From the perspective of interoperability, it would be good to >>>>>> support >>>>>> > >>>>> native >>>>>> > >>>>> type from file specs. Life will be easier for projects like >>>>>> Apache >>>>>> > >>>>> XTable. >>>>>> > >>>>> File format could also provide finer-grained statistics for >>>>>> variant >>>>>> > >>>>> type which >>>>>> > >>>>> facilitates data skipping. >>>>>> > >>>>> >>>>>> > >>>> >>>>>> > >>>> Agreed, there can definitely be additional value in native >>>>>> file format >>>>>> > >>>> integration. Just wanted to highlight that it's not a strict >>>>>> requirement. >>>>>> > >>>> >>>>>> > >>>> -Tyler >>>>>> > >>>> >>>>>> > >>>> >>>>>> > >>>>> >>>>>> > >>>>> Gang >>>>>> > >>>>> >>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau >>>>>> > >>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>> > >>>>> >>>>>> > >>>>>> Good to see you again as well, JB! Thanks! >>>>>> > >>>>>> >>>>>> > >>>>>> -Tyler >>>>>> > >>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré < >>>>>> j...@nanthrax.net> >>>>>> > >>>>>> wrote: >>>>>> > >>>>>> >>>>>> > >>>>>>> Hi Tyler, >>>>>> > >>>>>>> >>>>>> > >>>>>>> Super happy to see you there :) It reminds me our >>>>>> discussions back in >>>>>> > >>>>>>> the start of Apache Beam :) >>>>>> > >>>>>>> >>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some >>>>>> discussions >>>>>> > >>>>>>> about JSON datatype for spec v3. The binary data type is >>>>>> already >>>>>> > >>>>>>> supported in the spec v2. >>>>>> > >>>>>>> >>>>>> > >>>>>>> I'm looking forward to the proposal and happy to help on >>>>>> this ! >>>>>> > >>>>>>> >>>>>> > >>>>>>> Regards >>>>>> > >>>>>>> JB >>>>>> > >>>>>>> >>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau >>>>>> > >>>>>>> <tyler.aki...@snowflake.com.invalid> wrote: >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > Hello, >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a >>>>>> proposal for >>>>>> > >>>>>>> which we’d like to get early feedback from the community. >>>>>> As you may know, >>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake >>>>>> format. Having made >>>>>> > >>>>>>> good progress on our own adoption of the Iceberg standard, >>>>>> we’re now in a >>>>>> > >>>>>>> position where there are features not yet supported in >>>>>> Iceberg which we >>>>>> > >>>>>>> think would be valuable for our users, and that we would >>>>>> like to discuss >>>>>> > >>>>>>> with and help contribute to the Iceberg community. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > The first two such features we’d like to discuss are in >>>>>> support of >>>>>> > >>>>>>> efficient querying of dynamically typed, semi-structured >>>>>> data: variant data >>>>>> > >>>>>>> types, and subcolumnarization of variant columns. In more >>>>>> detail, for >>>>>> > >>>>>>> anyone who may not already be familiar: >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > 1. Variant data types >>>>>> > >>>>>>> > Variant types allow for the efficient binary encoding of >>>>>> dynamic >>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding >>>>>> semi-structured >>>>>> > >>>>>>> data as a variant column, we retain the flexibility of the >>>>>> source data, >>>>>> > >>>>>>> while allowing query engines to more efficiently operate on >>>>>> the data. >>>>>> > >>>>>>> Snowflake has supported the variant data type on Snowflake >>>>>> tables for many >>>>>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in >>>>>> Snowflake, >>>>>> > >>>>>>> we’re hearing an increasing chorus of requests for variant >>>>>> support. >>>>>> > >>>>>>> Additionally, other query engines such as Apache Spark have >>>>>> begun adding >>>>>> > >>>>>>> variant support [2]. As such, we believe it would be >>>>>> beneficial to the >>>>>> > >>>>>>> Iceberg community as a whole to standardize on the variant >>>>>> data type >>>>>> > >>>>>>> encoding used across Iceberg tables. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > One specific point to make here is that, since an Apache >>>>>> OSS >>>>>> > >>>>>>> version of variant encoding already exists in Spark, it >>>>>> likely makes sense >>>>>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard >>>>>> as well. The >>>>>> > >>>>>>> encoding we use internally today in Snowflake is slightly >>>>>> different, but >>>>>> > >>>>>>> essentially equivalent, and we see no particular value in >>>>>> trying to clutter >>>>>> > >>>>>>> the space with another equivalent-but-incompatible encoding. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > 2. Subcolumnarization >>>>>> > >>>>>>> > Subcolumnarization of variant columns allows query >>>>>> engines to >>>>>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested >>>>>> fields) within a >>>>>> > >>>>>>> variant column are queried, and also allows optionally >>>>>> materializing some >>>>>> > >>>>>>> of the nested fields as a column on their own, affording >>>>>> queries on these >>>>>> > >>>>>>> subcolumns the ability to read less data and spend less CPU >>>>>> on extraction. >>>>>> > >>>>>>> When subcolumnarizing, the system managing table metadata >>>>>> and data tracks >>>>>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for >>>>>> some subset of the >>>>>> > >>>>>>> nested fields within a variant, and also manages any >>>>>> optional >>>>>> > >>>>>>> materialization. Without subcolumnarization, any query >>>>>> which touches a >>>>>> > >>>>>>> variant column must read, parse, extract, and filter every >>>>>> row for which >>>>>> > >>>>>>> that column is non-null. Thus, by providing a standardized >>>>>> way of tracking >>>>>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can >>>>>> make >>>>>> > >>>>>>> subcolumnar optimizations accessible across various >>>>>> catalogs and query >>>>>> > >>>>>>> engines. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect >>>>>> any >>>>>> > >>>>>>> concrete proposal to include not only the set of changes to >>>>>> Iceberg >>>>>> > >>>>>>> metadata that allow compatible query engines to interopate >>>>>> on >>>>>> > >>>>>>> subcolumnarization data for variant columns, but also >>>>>> reference >>>>>> > >>>>>>> documentation explaining subcolumnarization principles and >>>>>> recommended best >>>>>> > >>>>>>> practices. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good >>>>>> starting >>>>>> > >>>>>>> point for how to approach this, so our plan is to write >>>>>> something up in >>>>>> > >>>>>>> that vein that covers the proposed spec changes, backwards >>>>>> compatibility, >>>>>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out >>>>>> to the community >>>>>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any >>>>>> early feedback >>>>>> > >>>>>>> we should incorporate before we spend too much time on a >>>>>> concrete proposal. >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > Thank you! >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > [1] >>>>>> > >>>>>>> >>>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured >>>>>> > >>>>>>> > [2] >>>>>> > >>>>>>> >>>>>> https://github.com/apache/spark/blob/master/common/variant/README.md >>>>>> > >>>>>>> > [3] >>>>>> > >>>>>>> >>>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit >>>>>> > >>>>>>> > >>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua >>>>>> > >>>>>>> > >>>>>> > >>>>>>> >>>>>> > >>>>>> >>>>>> > >>>>>> > -- >>>>>> > Ryan Blue >>>>>> > Databricks >>>>>> > >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Databricks >>>> >>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >> > > -- > Ryan Blue > Databricks >