Re: [Early Feedback] Variant and Subcolumnarization Support

Ryan Blue Fri, 12 Jul 2024 12:29:34 -0700

I raised the same point from Peter's email in a comment on the doc as well.
There is a spark-variant_2.13 artifact that would be a much smaller scope
than relying on large portions of Spark, but I even then I doubt that it is
a good idea for Iceberg to depend on that because it is a Scala artifact
and we would need to bring in a ton of Scala libs. I think what makes the
most sense is to have an independent implementation of the spec in Iceberg.


On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <[email protected]>
wrote:

> Hi Aihua,
> Long time no see :)
> Would this mean, that every engine which plans to support Variant data
> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
> Thanks, Peter
>
>
> On Fri, Jul 12, 2024, 19:10 Aihua Xu <[email protected]> wrote:
>
>> Thanks Ryan.
>>
>> Yeah. That's another reason we want to pursue Spark encoding to keep
>> compatibility for the open source engines.
>>
>> One more question regarding the encoding implementation: do we have an
>> issue to directly use Spark implementation in Iceberg? Russell pointed out
>> that Trino doesn't have Spark dependency and that could be a problem?
>>
>> Thanks,
>> Aihua
>>
>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>> > Thanks, Aihua!
>> >
>> > I think that the encoding choice in the current doc is a good one. I
>> went
>> > through the Spark encoding in detail and it looks like a better choice
>> than
>> > the other candidate encodings for quickly accessing nested fields.
>> >
>> > Another reason to use the Spark type is that this is what Delta's
>> variant
>> > type is based on, so Parquet files in tables written by Delta could be
>> > converted or used in Iceberg tables without needing to rewrite variant
>> > data. (Also, note that I work at Databricks and have an interest in
>> > increasing format compatibility.)
>> >
>> > Ryan
>> >
>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <[email protected]
>> .invalid>
>> > wrote:
>> >
>> > > [Discuss] Consensus for Variant Encoding
>> > >
>> > > It’s great to be able to present the Variant type proposal in the
>> > > community sync yesterday and I’m looking to host a meeting next week
>> > > (targeting for 9am, July 17th) to go over any further concerns about
>> the
>> > > encoding of the Variant type and any other questions on the first
>> phase of
>> > > the proposal
>> > > <
>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>> >.
>> > > We are hoping that anyone who is interested in the proposal can
>> either join
>> > > or reply with their comments so we can discuss them. Summary of the
>> > > discussion and notes will be sent to the mailing list for further
>> comment
>> > > there.
>> > >
>> > >
>> > >    -
>> > >
>> > >    What should be the underlying binary representation
>> > >
>> > > We have evaluated a few encodings in the doc including ION, JSONB, and
>> > > Spark encoding.Choosing the underlying encoding is an important first
>> step
>> > > here and we believe we have general support for Spark’s Variant
>> encoding.
>> > > We would like to hear if anyone else has strong opinions in this
>> space.
>> > >
>> > >
>> > >    -
>> > >
>> > >    Should we support multiple logical types or just Variant? Variant
>> vs.
>> > >    Variant + JSON.
>> > >
>> > > This is to discuss what logical data type(s) to be supported in
>> Iceberg -
>> > > Variant only vs. Variant + JSON. Both types would share the same
>> underlying
>> > > encoding but would imply different limitations on engines working with
>> > > those types.
>> > >
>> > > From the sync up meeting, we are more favoring toward supporting
>> Variant
>> > > only and we want to have a consensus on the supported type(s).
>> > >
>> > >
>> > >    -
>> > >
>> > >    How should we move forward with Subcolumnization?
>> > >
>> > > Subcolumnization is an optimization for Variant type by separating out
>> > > subcolumns with their own metadata. This is not critical for choosing
>> the
>> > > initial encoding of the Variant type so we were hoping to gain
>> consensus on
>> > > leaving that for a follow up spec.
>> > >
>> > >
>> > > Thanks
>> > >
>> > > Aihua
>> > >
>> > > Meeting invite:
>> > >
>> > > Wednesday, July 17 · 9:00 – 10:00am
>> > > Time zone: America/Los_Angeles
>> > > Google Meet joining info
>> > > Video call link: https://meet.google.com/pbm-ovzn-aoq
>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>> > > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>> > >
>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <[email protected]>
>> wrote:
>> > >
>> > >> Hello,
>> > >>
>> > >> We have drafted the proposal
>> > >> <
>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>> >
>> > >> for Variant data type. Please help review and comment.
>> > >>
>> > >> Thanks,
>> > >> Aihua
>> > >>
>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <[email protected]>
>> wrote:
>> > >>
>> > >>> +10000 for a JSON/BSON type. We also had the same discussion
>> internally
>> > >>> and a JSON type would really play well with for example the SUPER
>> type in
>> > >>> Redshift:
>> > >>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>> and
>> > >>> can also provide better integration with the Trino JSON type.
>> > >>>
>> > >>> Looking forward to the proposal!
>> > >>>
>> > >>> Best,
>> > >>> Jack Ye
>> > >>>
>> > >>>
>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>> > >>> <[email protected]> wrote:
>> > >>>
>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <[email protected]> wrote:
>> > >>>>
>> > >>>>> > We may need some guidance on just how many we need to look at;
>> > >>>>> > we were planning on Spark and Trino, but weren't sure how much
>> > >>>>> > further down the rabbit hole we needed to go。
>> > >>>>>
>> > >>>>> There are some engines living outside the Java world. It would be
>> > >>>>> good if the proposal could cover the effort it takes to integrate
>> > >>>>> variant type to them (e.g. velox, datafusion, etc.). This is
>> something
>> > >>>>> that
>> > >>>>> some proprietary iceberg vendors also care about.
>> > >>>>>
>> > >>>>
>> > >>>> Ack, makes sense. We can make sure to share some perspective on
>> this.
>> > >>>>
>> > >>>> > Not necessarily, no. As long as there's a binary type and
>> Iceberg and
>> > >>>>> > the query engines are aware that the binary column needs to be
>> > >>>>> > interpreted as a variant, that should be sufficient.
>> > >>>>>
>> > >>>>> From the perspective of interoperability, it would be good to
>> support
>> > >>>>> native
>> > >>>>> type from file specs. Life will be easier for projects like Apache
>> > >>>>> XTable.
>> > >>>>> File format could also provide finer-grained statistics for
>> variant
>> > >>>>> type which
>> > >>>>> facilitates data skipping.
>> > >>>>>
>> > >>>>
>> > >>>> Agreed, there can definitely be additional value in native file
>> format
>> > >>>> integration. Just wanted to highlight that it's not a strict
>> requirement.
>> > >>>>
>> > >>>> -Tyler
>> > >>>>
>> > >>>>
>> > >>>>>
>> > >>>>> Gang
>> > >>>>>
>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>> > >>>>> <[email protected]> wrote:
>> > >>>>>
>> > >>>>>> Good to see you again as well, JB! Thanks!
>> > >>>>>>
>> > >>>>>> -Tyler
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste Onofré <
>> [email protected]>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> Hi Tyler,
>> > >>>>>>>
>> > >>>>>>> Super happy to see you there :) It reminds me our discussions
>> back in
>> > >>>>>>> the start of Apache Beam :)
>> > >>>>>>>
>> > >>>>>>> Anyway, the thread is pretty interesting. I remember some
>> discussions
>> > >>>>>>> about JSON datatype for spec v3. The binary data type is already
>> > >>>>>>> supported in the spec v2.
>> > >>>>>>>
>> > >>>>>>> I'm looking forward to the proposal and happy to help on this !
>> > >>>>>>>
>> > >>>>>>> Regards
>> > >>>>>>> JB
>> > >>>>>>>
>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>> > >>>>>>> <[email protected]> wrote:
>> > >>>>>>> >
>> > >>>>>>> > Hello,
>> > >>>>>>> >
>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are working on a proposal
>> for
>> > >>>>>>> which we’d like to get early feedback from the community. As
>> you may know,
>> > >>>>>>> Snowflake has embraced Iceberg as its open Data Lake format.
>> Having made
>> > >>>>>>> good progress on our own adoption of the Iceberg standard,
>> we’re now in a
>> > >>>>>>> position where there are features not yet supported in Iceberg
>> which we
>> > >>>>>>> think would be valuable for our users, and that we would like
>> to discuss
>> > >>>>>>> with and help contribute to the Iceberg community.
>> > >>>>>>> >
>> > >>>>>>> > The first two such features we’d like to discuss are in
>> support of
>> > >>>>>>> efficient querying of dynamically typed, semi-structured data:
>> variant data
>> > >>>>>>> types, and subcolumnarization of variant columns. In more
>> detail, for
>> > >>>>>>> anyone who may not already be familiar:
>> > >>>>>>> >
>> > >>>>>>> > 1. Variant data types
>> > >>>>>>> > Variant types allow for the efficient binary encoding of
>> dynamic
>> > >>>>>>> semi-structured data such as JSON, Avro, etc. By encoding
>> semi-structured
>> > >>>>>>> data as a variant column, we retain the flexibility of the
>> source data,
>> > >>>>>>> while allowing query engines to more efficiently operate on the
>> data.
>> > >>>>>>> Snowflake has supported the variant data type on Snowflake
>> tables for many
>> > >>>>>>> years [1]. As more and more users utilize Iceberg tables in
>> Snowflake,
>> > >>>>>>> we’re hearing an increasing chorus of requests for variant
>> support.
>> > >>>>>>> Additionally, other query engines such as Apache Spark have
>> begun adding
>> > >>>>>>> variant support [2]. As such, we believe it would be beneficial
>> to the
>> > >>>>>>> Iceberg community as a whole to standardize on the variant data
>> type
>> > >>>>>>> encoding used across Iceberg tables.
>> > >>>>>>> >
>> > >>>>>>> > One specific point to make here is that, since an Apache OSS
>> > >>>>>>> version of variant encoding already exists in Spark, it likely
>> makes sense
>> > >>>>>>> to simply adopt the Spark encoding as the Iceberg standard as
>> well. The
>> > >>>>>>> encoding we use internally today in Snowflake is slightly
>> different, but
>> > >>>>>>> essentially equivalent, and we see no particular value in
>> trying to clutter
>> > >>>>>>> the space with another equivalent-but-incompatible encoding.
>> > >>>>>>> >
>> > >>>>>>> >
>> > >>>>>>> > 2. Subcolumnarization
>> > >>>>>>> > Subcolumnarization of variant columns allows query engines to
>> > >>>>>>> efficiently prune datasets when subcolumns (i.e., nested
>> fields) within a
>> > >>>>>>> variant column are queried, and also allows optionally
>> materializing some
>> > >>>>>>> of the nested fields as a column on their own, affording
>> queries on these
>> > >>>>>>> subcolumns the ability to read less data and spend less CPU on
>> extraction.
>> > >>>>>>> When subcolumnarizing, the system managing table metadata and
>> data tracks
>> > >>>>>>> individual pruning statistics (min, max, null, etc.) for some
>> subset of the
>> > >>>>>>> nested fields within a variant, and also manages any optional
>> > >>>>>>> materialization. Without subcolumnarization, any query which
>> touches a
>> > >>>>>>> variant column must read, parse, extract, and filter every row
>> for which
>> > >>>>>>> that column is non-null. Thus, by providing a standardized way
>> of tracking
>> > >>>>>>> subcolum metadata and data for variant columns, Iceberg can make
>> > >>>>>>> subcolumnar optimizations accessible across various catalogs
>> and query
>> > >>>>>>> engines.
>> > >>>>>>> >
>> > >>>>>>> > Subcolumnarization is a non-trivial topic, so we expect any
>> > >>>>>>> concrete proposal to include not only the set of changes to
>> Iceberg
>> > >>>>>>> metadata that allow compatible query engines to interopate on
>> > >>>>>>> subcolumnarization data for variant columns, but also reference
>> > >>>>>>> documentation explaining subcolumnarization principles and
>> recommended best
>> > >>>>>>> practices.
>> > >>>>>>> >
>> > >>>>>>> >
>> > >>>>>>> > It sounds like the recent Geo proposal [3] may be a good
>> starting
>> > >>>>>>> point for how to approach this, so our plan is to write
>> something up in
>> > >>>>>>> that vein that covers the proposed spec changes, backwards
>> compatibility,
>> > >>>>>>> implementor burdens, etc. But we wanted to first reach out to
>> the community
>> > >>>>>>> to introduce ourselves and the idea, and see if there’s any
>> early feedback
>> > >>>>>>> we should incorporate before we spend too much time on a
>> concrete proposal.
>> > >>>>>>> >
>> > >>>>>>> > Thank you!
>> > >>>>>>> >
>> > >>>>>>> > [1]
>> > >>>>>>>
>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>> > >>>>>>> > [2]
>> > >>>>>>>
>> https://github.com/apache/spark/blob/master/common/variant/README.md
>> > >>>>>>> > [3]
>> > >>>>>>>
>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>> > >>>>>>> >
>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>> > >>>>>>> >
>> > >>>>>>>
>> > >>>>>>
>> >
>> > --
>> > Ryan Blue
>> > Databricks
>> >
>>
>

-- 
Ryan Blue
Databricks

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to