Re: [Early Feedback] Variant and Subcolumnarization Support

Aihua Xu Sat, 20 Jul 2024 17:53:58 -0700

Thanks for the discussion and feedback.

Do we have the consensus on point 1 and point 3 to move forward with Spark
variant encoding and support Variant type only? Or let me know how to
proceed from here.


Regarding point 2, I also feel Iceberg is more natural to host such a
subproject for variant spec and implementation. But let me reach out to the
Spark community to discuss.

Thanks,
Aihua


On Fri, Jul 19, 2024 at 9:35 AM Yufei Gu <[email protected]> wrote:

> Agreed with point 1.
>
> For point 2, I also prefer to hold the spec and reference implementation
> under Iceberg. Here are the reasons:
> 1. It is unconventional and impractical for one engine to depend on
> another for data types. For instance, it is not ideal for Trino to rely on
> data types defined by the Spark engine.
> 2. Iceberg serves as a bridge between engines and file formats. By
> centralizing the specification in Iceberg, any future optimizations or
> updates to file formats can be referred to within Iceberg, ensuring
> consistency and reducing dependencies.
>
> For point 3, I'd prefer to support the variant type only at this moment.
>
> Yufei
>
>
> On Thu, Jul 18, 2024 at 12:55 PM Ryan Blue <[email protected]>
> wrote:
>
>> Similarly, I'm aligned with point 1 and I'd choose to support only
>> variant for point 3.
>>
>> We'll need to work with the Spark community to find a good place for the
>> library and spec, since it touches many different projects. I'd also prefer
>> Iceberg as the home.
>>
>> I also think it's a good idea to get subcolumnarization into our spec
>> when we update. Without that I think the feature will be fairly limited.
>>
>> On Thu, Jul 18, 2024 at 10:56 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I'm aligned with point 1.
>>>
>>> For point 2 I think we should choose quickly, I honestly do think this
>>> would be fine as part of the Iceberg Spec directly but understand it may be
>>> better for the more broad community if it was a sub project. As a
>>> sub-project I would still prefer it being an Iceberg Subproject since we
>>> are engine/file-format agnostic.
>>>
>>> 3. I support adding just Variant.
>>>
>>> On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu <[email protected]> wrote:
>>>
>>>> Hello community,
>>>>
>>>> It’s great to sync up with some of you on Variant and SubColumarization
>>>> support in Iceberg again. Apologize that I didn’t record the meeting but
>>>> here are some key items that we want to follow up with the community.
>>>>
>>>> 1. Adopt Spark Variant encoding
>>>> Those present were in favor of  adopting the Spark variant encoding for
>>>> Iceberg Variant with extensions to support other Iceberg types. We would
>>>> like to know if anyone has an objection to this to reuse an open source
>>>> encoding.
>>>>
>>>> 2. Movement of the Spark Variant Spec to another project
>>>> To avoid introducing Apache Spark as a dependency for the engines and
>>>> file formats, we discussed separating Spark Variant encoding spec and
>>>> implementation from the Spark Project to a neutral location. We thought up
>>>> several solutions but didn’t have consensus on any of them. We are looking
>>>> for more feedback on this topic from the community either in terms of
>>>> support for one of these options or another idea on how to support the 
>>>> spec.
>>>>
>>>> Options Proposed:
>>>> * Leave the Spec in Spark (Difficult for versioning and other engines)
>>>> * Copying the Spec into Iceberg Project Directly (Difficult for other
>>>> Table Formats)
>>>> * Creating a Sub-Project of Apache Iceberg and moving the spec and
>>>> reference implementation there (Logistically complicated)
>>>> * Creating a Sub-Project of Apache Spark and moving the spec and
>>>> reference implementation there (Logistically complicated)
>>>>
>>>> 3. Add Variant type vs. Variant and JSON types
>>>> Those who were present were in favor of adding only the Variant type to
>>>> Iceberg. We are looking for anyone who has an objection to going forward
>>>> with just the Variant Type and no Iceberg JSON Type. We were favoring
>>>> adding Variant type only because:
>>>> * Introducing a JSON type would require engines that only support
>>>> VARIANT to do write time validation of their input to a JSON column. If
>>>> they don’t have a JSON type an engine wouldn’t support this.
>>>> * Engines which don’t support Variant will work most of the time but
>>>> can have fallback strings defined in the spec for reading unsupported
>>>> types. Writing a JSON into a Variant will always work.
>>>>
>>>> 4. Support for Subcolumnization spec (shredding in Spark)
>>>> We have no action items on this but would like to follow up on
>>>> discussions on Subcolumnization in the future.
>>>> * We had general agreement that this should be included in Iceberg V3
>>>> or else adding variant may not be useful.
>>>> * We are interested in also adopting the shredding spec from Spark and
>>>> would like to move it to whatever place we decided the Variant spec is
>>>> going to live.
>>>>
>>>> Let us know if missed anything and if you have any additional thoughts
>>>> or suggestions.
>>>>
>>>> Thanks
>>>> Aihua
>>>>
>>>>
>>>> On 2024/07/15 18:32:22 Aihua Xu wrote:
>>>> > Thanks for the discussion.
>>>> >
>>>> > I will move forward to work on spec PR.
>>>> >
>>>> > Regarding the implementation, we will have module for Variant support
>>>> in Iceberg so we will not have to bring in Spark libraries.
>>>> >
>>>> > I'm reposting the meeting invite in case it's not clear in my
>>>> original email since I included in the end. Looks like we don't have major
>>>> objections/diverges but let's sync up and have consensus.
>>>> >
>>>> > Meeting invite:
>>>> >
>>>> > Wednesday, July 17 · 9:00 – 10:00am
>>>> > Time zone: America/Los_Angeles
>>>> > Google Meet joining info
>>>> > Video call link: https://meet.google.com/pbm-ovzn-aoq
>>>> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>> >
>>>> > Thanks,
>>>> > Aihua
>>>> >
>>>> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
>>>> > > I don't think this needs to hold up the PR but I think coming to a
>>>> > > consensus on the exact set of types supported is worthwhile (and if
>>>> the
>>>> > > goal is to maintain the same set as specified by the Spark Variant
>>>> type or
>>>> > > if divergence is expected/allowed).  From a fragmentation
>>>> perspective it
>>>> > > would be a shame if they diverge, so maybe a next step is also
>>>> suggesting
>>>> > > support to the Spark community on the missing existing Iceberg
>>>> types?
>>>> > >
>>>> > > Thanks,
>>>> > > Micah
>>>> > >
>>>> > > On Fri, Jul 12, 2024 at 1:44 PM Russell Spitzer <
>>>> [email protected]>
>>>> > > wrote:
>>>> > >
>>>> > > > Just talked with Aihua and he's working on the Spec PR now. We
>>>> can get
>>>> > > > feedback there from everyone.
>>>> > > >
>>>> > > > On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue
>>>> <[email protected]>
>>>> > > > wrote:
>>>> > > >
>>>> > > >> Good idea, but I'm hoping that we can continue to get their
>>>> feedback in
>>>> > > >> parallel to getting the spec changes started. Piotr didn't seem
>>>> to object
>>>> > > >> to the encoding from what I read of his comments. Hopefully he
>>>> (and others)
>>>> > > >> chime in here.
>>>> > > >>
>>>> > > >> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer <
>>>> > > >> [email protected]> wrote:
>>>> > > >>
>>>> > > >>> I just want to make sure we get Piotr and Peter on board as
>>>> > > >>> representatives of Flink and Trino engines. Also make sure we
>>>> have anyone
>>>> > > >>> else chime in who has experience with Ray if possible.
>>>> > > >>>
>>>> > > >>> Spec changes feel like the right next step.
>>>> > > >>>
>>>> > > >>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue
>>>> <[email protected]>
>>>> > > >>> wrote:
>>>> > > >>>
>>>> > > >>>> Okay, what are the next steps here? This proposal has been out
>>>> for
>>>> > > >>>> quite a while and I don't see any major objections to using
>>>> the Spark
>>>> > > >>>> encoding. It's quite well designed and fits the need well. It
>>>> can also be
>>>> > > >>>> extended to support additional types that are missing if
>>>> that's a priority.
>>>> > > >>>>
>>>> > > >>>> Should we move forward by starting a draft of the changes to
>>>> the table
>>>> > > >>>> spec? Then we can vote on committing those changes and get
>>>> moving on an
>>>> > > >>>> implementation (or possibly do the implementation in parallel).
>>>> > > >>>>
>>>> > > >>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>>> > > >>>> [email protected]> wrote:
>>>> > > >>>>
>>>> > > >>>>> That's fair, I'm sold on an Iceberg Module.
>>>> > > >>>>>
>>>> > > >>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue
>>>> <[email protected]>
>>>> > > >>>>> wrote:
>>>> > > >>>>>
>>>> > > >>>>>> > Feels like eventually the encoding should land in parquet
>>>> proper
>>>> > > >>>>>> right?
>>>> > > >>>>>>
>>>> > > >>>>>> What about using it in ORC? I don't know where it should end
>>>> up.
>>>> > > >>>>>> Maybe Iceberg should make a standalone module from it?
>>>> > > >>>>>>
>>>> > > >>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>> > > >>>>>> [email protected]> wrote:
>>>> > > >>>>>>
>>>> > > >>>>>>> Feels like eventually the encoding should land in parquet
>>>> proper
>>>> > > >>>>>>> right? I'm fine with us just copying into Iceberg though
>>>> for the time
>>>> > > >>>>>>> being.
>>>> > > >>>>>>>
>>>> > > >>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue
>>>> > > >>>>>>> <[email protected]> wrote:
>>>> > > >>>>>>>
>>>> > > >>>>>>>> Oops, it looks like I missed where Aihua brought this up
>>>> in his
>>>> > > >>>>>>>> last email:
>>>> > > >>>>>>>>
>>>> > > >>>>>>>> > do we have an issue to directly use Spark implementation
>>>> in
>>>> > > >>>>>>>> Iceberg?
>>>> > > >>>>>>>>
>>>> > > >>>>>>>> Yes, I think that we do have an issue using the Spark
>>>> library. What
>>>> > > >>>>>>>> do you think about a Java implementation in Iceberg?
>>>> > > >>>>>>>>
>>>> > > >>>>>>>> Ryan
>>>> > > >>>>>>>>
>>>> > > >>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue <
>>>> [email protected]>
>>>> > > >>>>>>>> wrote:
>>>> > > >>>>>>>>
>>>> > > >>>>>>>>> I raised the same point from Peter's email in a comment
>>>> on the doc
>>>> > > >>>>>>>>> as well. There is a spark-variant_2.13 artifact that
>>>> would be a much
>>>> > > >>>>>>>>> smaller scope than relying on large portions of Spark,
>>>> but I even then I
>>>> > > >>>>>>>>> doubt that it is a good idea for Iceberg to depend on
>>>> that because it is a
>>>> > > >>>>>>>>> Scala artifact and we would need to bring in a ton of
>>>> Scala libs. I think
>>>> > > >>>>>>>>> what makes the most sense is to have an independent
>>>> implementation of the
>>>> > > >>>>>>>>> spec in Iceberg.
>>>> > > >>>>>>>>>
>>>> > > >>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>> > > >>>>>>>>> [email protected]> wrote:
>>>> > > >>>>>>>>>
>>>> > > >>>>>>>>>> Hi Aihua,
>>>> > > >>>>>>>>>> Long time no see :)
>>>> > > >>>>>>>>>> Would this mean, that every engine which plans to
>>>> support Variant
>>>> > > >>>>>>>>>> data type needs to add Spark as a dependency? Like
>>>> Flink/Trino/Hive etc?
>>>> > > >>>>>>>>>> Thanks, Peter
>>>> > > >>>>>>>>>>
>>>> > > >>>>>>>>>>
>>>> > > >>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu <[email protected]>
>>>> wrote:
>>>> > > >>>>>>>>>>
>>>> > > >>>>>>>>>>> Thanks Ryan.
>>>> > > >>>>>>>>>>>
>>>> > > >>>>>>>>>>> Yeah. That's another reason we want to pursue Spark
>>>> encoding to
>>>> > > >>>>>>>>>>> keep compatibility for the open source engines.
>>>> > > >>>>>>>>>>>
>>>> > > >>>>>>>>>>> One more question regarding the encoding
>>>> implementation: do we
>>>> > > >>>>>>>>>>> have an issue to directly use Spark implementation in
>>>> Iceberg? Russell
>>>> > > >>>>>>>>>>> pointed out that Trino doesn't have Spark dependency
>>>> and that could be a
>>>> > > >>>>>>>>>>> problem?
>>>> > > >>>>>>>>>>>
>>>> > > >>>>>>>>>>> Thanks,
>>>> > > >>>>>>>>>>> Aihua
>>>> > > >>>>>>>>>>>
>>>> > > >>>>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>> > > >>>>>>>>>>> > Thanks, Aihua!
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > I think that the encoding choice in the current doc
>>>> is a good
>>>> > > >>>>>>>>>>> one. I went
>>>> > > >>>>>>>>>>> > through the Spark encoding in detail and it looks
>>>> like a
>>>> > > >>>>>>>>>>> better choice than
>>>> > > >>>>>>>>>>> > the other candidate encodings for quickly accessing
>>>> nested
>>>> > > >>>>>>>>>>> fields.
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > Another reason to use the Spark type is that this is
>>>> what
>>>> > > >>>>>>>>>>> Delta's variant
>>>> > > >>>>>>>>>>> > type is based on, so Parquet files in tables written
>>>> by Delta
>>>> > > >>>>>>>>>>> could be
>>>> > > >>>>>>>>>>> > converted or used in Iceberg tables without needing
>>>> to rewrite
>>>> > > >>>>>>>>>>> variant
>>>> > > >>>>>>>>>>> > data. (Also, note that I work at Databricks and have
>>>> an
>>>> > > >>>>>>>>>>> interest in
>>>> > > >>>>>>>>>>> > increasing format compatibility.)
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > Ryan
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu <
>>>> > > >>>>>>>>>>> [email protected]>
>>>> > > >>>>>>>>>>> > wrote:
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > It’s great to be able to present the Variant type
>>>> proposal
>>>> > > >>>>>>>>>>> in the
>>>> > > >>>>>>>>>>> > > community sync yesterday and I’m looking to host a
>>>> meeting
>>>> > > >>>>>>>>>>> next week
>>>> > > >>>>>>>>>>> > > (targeting for 9am, July 17th) to go over any
>>>> further
>>>> > > >>>>>>>>>>> concerns about the
>>>> > > >>>>>>>>>>> > > encoding of the Variant type and any other
>>>> questions on the
>>>> > > >>>>>>>>>>> first phase of
>>>> > > >>>>>>>>>>> > > the proposal
>>>> > > >>>>>>>>>>> > > <
>>>> > > >>>>>>>>>>>
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> > > >>>>>>>>>>> >.
>>>> > > >>>>>>>>>>> > > We are hoping that anyone who is interested in the
>>>> proposal
>>>> > > >>>>>>>>>>> can either join
>>>> > > >>>>>>>>>>> > > or reply with their comments so we can discuss
>>>> them. Summary
>>>> > > >>>>>>>>>>> of the
>>>> > > >>>>>>>>>>> > > discussion and notes will be sent to the mailing
>>>> list for
>>>> > > >>>>>>>>>>> further comment
>>>> > > >>>>>>>>>>> > > there.
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    -
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    What should be the underlying binary
>>>> representation
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > We have evaluated a few encodings in the doc
>>>> including ION,
>>>> > > >>>>>>>>>>> JSONB, and
>>>> > > >>>>>>>>>>> > > Spark encoding.Choosing the underlying encoding is
>>>> an
>>>> > > >>>>>>>>>>> important first step
>>>> > > >>>>>>>>>>> > > here and we believe we have general support for
>>>> Spark’s
>>>> > > >>>>>>>>>>> Variant encoding.
>>>> > > >>>>>>>>>>> > > We would like to hear if anyone else has strong
>>>> opinions in
>>>> > > >>>>>>>>>>> this space.
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    -
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    Should we support multiple logical types or just
>>>> Variant?
>>>> > > >>>>>>>>>>> Variant vs.
>>>> > > >>>>>>>>>>> > >    Variant + JSON.
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > This is to discuss what logical data type(s) to be
>>>> supported
>>>> > > >>>>>>>>>>> in Iceberg -
>>>> > > >>>>>>>>>>> > > Variant only vs. Variant + JSON. Both types would
>>>> share the
>>>> > > >>>>>>>>>>> same underlying
>>>> > > >>>>>>>>>>> > > encoding but would imply different limitations on
>>>> engines
>>>> > > >>>>>>>>>>> working with
>>>> > > >>>>>>>>>>> > > those types.
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > From the sync up meeting, we are more favoring
>>>> toward
>>>> > > >>>>>>>>>>> supporting Variant
>>>> > > >>>>>>>>>>> > > only and we want to have a consensus on the
>>>> supported
>>>> > > >>>>>>>>>>> type(s).
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    -
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >    How should we move forward with Subcolumnization?
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > Subcolumnization is an optimization for Variant
>>>> type by
>>>> > > >>>>>>>>>>> separating out
>>>> > > >>>>>>>>>>> > > subcolumns with their own metadata. This is not
>>>> critical for
>>>> > > >>>>>>>>>>> choosing the
>>>> > > >>>>>>>>>>> > > initial encoding of the Variant type so we were
>>>> hoping to
>>>> > > >>>>>>>>>>> gain consensus on
>>>> > > >>>>>>>>>>> > > leaving that for a follow up spec.
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > Thanks
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > Aihua
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > Meeting invite:
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > Wednesday, July 17 · 9:00 – 10:00am
>>>> > > >>>>>>>>>>> > > Time zone: America/Los_Angeles
>>>> > > >>>>>>>>>>> > > Google Meet joining info
>>>> > > >>>>>>>>>>> > > Video call link:
>>>> https://meet.google.com/pbm-ovzn-aoq
>>>> > > >>>>>>>>>>> > > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
>>>> > > >>>>>>>>>>> > > More phone numbers:
>>>> > > >>>>>>>>>>> https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > > On Tue, May 28, 2024 at 9:21 PM Aihua Xu <
>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>> > > >>>>>>>>>>> > >
>>>> > > >>>>>>>>>>> > >> Hello,
>>>> > > >>>>>>>>>>> > >>
>>>> > > >>>>>>>>>>> > >> We have drafted the proposal
>>>> > > >>>>>>>>>>> > >> <
>>>> > > >>>>>>>>>>>
>>>> https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > >> for Variant data type. Please help review and
>>>> comment.
>>>> > > >>>>>>>>>>> > >>
>>>> > > >>>>>>>>>>> > >> Thanks,
>>>> > > >>>>>>>>>>> > >> Aihua
>>>> > > >>>>>>>>>>> > >>
>>>> > > >>>>>>>>>>> > >> On Thu, May 16, 2024 at 12:45 PM Jack Ye <
>>>> > > >>>>>>>>>>> [email protected]> wrote:
>>>> > > >>>>>>>>>>> > >>
>>>> > > >>>>>>>>>>> > >>> +10000 for a JSON/BSON type. We also had the same
>>>> > > >>>>>>>>>>> discussion internally
>>>> > > >>>>>>>>>>> > >>> and a JSON type would really play well with for
>>>> example
>>>> > > >>>>>>>>>>> the SUPER type in
>>>> > > >>>>>>>>>>> > >>> Redshift:
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>>
>>>> https://docs.aws.amazon.com/redshift/latest/dg/r_SUPER_type.html,
>>>> > > >>>>>>>>>>> and
>>>> > > >>>>>>>>>>> > >>> can also provide better integration with the
>>>> Trino JSON
>>>> > > >>>>>>>>>>> type.
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>> > >>> Looking forward to the proposal!
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>> > >>> Best,
>>>> > > >>>>>>>>>>> > >>> Jack Ye
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>> > >>> On Wed, May 15, 2024 at 9:37 AM Tyler Akidau
>>>> > > >>>>>>>>>>> > >>> <[email protected]> wrote:
>>>> > > >>>>>>>>>>> > >>>
>>>> > > >>>>>>>>>>> > >>>> On Tue, May 14, 2024 at 7:58 PM Gang Wu <
>>>> [email protected]>
>>>> > > >>>>>>>>>>> wrote:
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>>> > We may need some guidance on just how many we
>>>> need to
>>>> > > >>>>>>>>>>> look at;
>>>> > > >>>>>>>>>>> > >>>>> > we were planning on Spark and Trino, but
>>>> weren't sure
>>>> > > >>>>>>>>>>> how much
>>>> > > >>>>>>>>>>> > >>>>> > further down the rabbit hole we needed to go。
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>> There are some engines living outside the Java
>>>> world. It
>>>> > > >>>>>>>>>>> would be
>>>> > > >>>>>>>>>>> > >>>>> good if the proposal could cover the effort it
>>>> takes to
>>>> > > >>>>>>>>>>> integrate
>>>> > > >>>>>>>>>>> > >>>>> variant type to them (e.g. velox, datafusion,
>>>> etc.).
>>>> > > >>>>>>>>>>> This is something
>>>> > > >>>>>>>>>>> > >>>>> that
>>>> > > >>>>>>>>>>> > >>>>> some proprietary iceberg vendors also care
>>>> about.
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>> Ack, makes sense. We can make sure to share some
>>>> > > >>>>>>>>>>> perspective on this.
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>> > Not necessarily, no. As long as there's a
>>>> binary type
>>>> > > >>>>>>>>>>> and Iceberg and
>>>> > > >>>>>>>>>>> > >>>>> > the query engines are aware that the binary
>>>> column
>>>> > > >>>>>>>>>>> needs to be
>>>> > > >>>>>>>>>>> > >>>>> > interpreted as a variant, that should be
>>>> sufficient.
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>> From the perspective of interoperability, it
>>>> would be
>>>> > > >>>>>>>>>>> good to support
>>>> > > >>>>>>>>>>> > >>>>> native
>>>> > > >>>>>>>>>>> > >>>>> type from file specs. Life will be easier for
>>>> projects
>>>> > > >>>>>>>>>>> like Apache
>>>> > > >>>>>>>>>>> > >>>>> XTable.
>>>> > > >>>>>>>>>>> > >>>>> File format could also provide finer-grained
>>>> statistics
>>>> > > >>>>>>>>>>> for variant
>>>> > > >>>>>>>>>>> > >>>>> type which
>>>> > > >>>>>>>>>>> > >>>>> facilitates data skipping.
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>> Agreed, there can definitely be additional value
>>>> in
>>>> > > >>>>>>>>>>> native file format
>>>> > > >>>>>>>>>>> > >>>> integration. Just wanted to highlight that it's
>>>> not a
>>>> > > >>>>>>>>>>> strict requirement.
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>> -Tyler
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>>
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>> Gang
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>> On Wed, May 15, 2024 at 6:49 AM Tyler Akidau
>>>> > > >>>>>>>>>>> > >>>>> <[email protected]> wrote:
>>>> > > >>>>>>>>>>> > >>>>>
>>>> > > >>>>>>>>>>> > >>>>>> Good to see you again as well, JB! Thanks!
>>>> > > >>>>>>>>>>> > >>>>>>
>>>> > > >>>>>>>>>>> > >>>>>> -Tyler
>>>> > > >>>>>>>>>>> > >>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>
>>>> > > >>>>>>>>>>> > >>>>>> On Tue, May 14, 2024 at 1:04 PM Jean-Baptiste
>>>> Onofré <
>>>> > > >>>>>>>>>>> [email protected]>
>>>> > > >>>>>>>>>>> > >>>>>> wrote:
>>>> > > >>>>>>>>>>> > >>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> Hi Tyler,
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> Super happy to see you there :) It reminds me
>>>> our
>>>> > > >>>>>>>>>>> discussions back in
>>>> > > >>>>>>>>>>> > >>>>>>> the start of Apache Beam :)
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> Anyway, the thread is pretty interesting. I
>>>> remember
>>>> > > >>>>>>>>>>> some discussions
>>>> > > >>>>>>>>>>> > >>>>>>> about JSON datatype for spec v3. The binary
>>>> data type
>>>> > > >>>>>>>>>>> is already
>>>> > > >>>>>>>>>>> > >>>>>>> supported in the spec v2.
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> I'm looking forward to the proposal and happy
>>>> to help
>>>> > > >>>>>>>>>>> on this !
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> Regards
>>>> > > >>>>>>>>>>> > >>>>>>> JB
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>> On Sat, May 11, 2024 at 7:06 AM Tyler Akidau
>>>> > > >>>>>>>>>>> > >>>>>>> <[email protected]> wrote:
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > Hello,
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > We (Tyler, Nileema, Selcuk, Aihua) are
>>>> working on a
>>>> > > >>>>>>>>>>> proposal for
>>>> > > >>>>>>>>>>> > >>>>>>> which we’d like to get early feedback from the
>>>> > > >>>>>>>>>>> community. As you may know,
>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has embraced Iceberg as its open
>>>> Data Lake
>>>> > > >>>>>>>>>>> format. Having made
>>>> > > >>>>>>>>>>> > >>>>>>> good progress on our own adoption of the
>>>> Iceberg
>>>> > > >>>>>>>>>>> standard, we’re now in a
>>>> > > >>>>>>>>>>> > >>>>>>> position where there are features not yet
>>>> supported in
>>>> > > >>>>>>>>>>> Iceberg which we
>>>> > > >>>>>>>>>>> > >>>>>>> think would be valuable for our users, and
>>>> that we
>>>> > > >>>>>>>>>>> would like to discuss
>>>> > > >>>>>>>>>>> > >>>>>>> with and help contribute to the Iceberg
>>>> community.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > The first two such features we’d like to
>>>> discuss are
>>>> > > >>>>>>>>>>> in support of
>>>> > > >>>>>>>>>>> > >>>>>>> efficient querying of dynamically typed,
>>>> > > >>>>>>>>>>> semi-structured data: variant data
>>>> > > >>>>>>>>>>> > >>>>>>> types, and subcolumnarization of variant
>>>> columns. In
>>>> > > >>>>>>>>>>> more detail, for
>>>> > > >>>>>>>>>>> > >>>>>>> anyone who may not already be familiar:
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > 1. Variant data types
>>>> > > >>>>>>>>>>> > >>>>>>> > Variant types allow for the efficient binary
>>>> > > >>>>>>>>>>> encoding of dynamic
>>>> > > >>>>>>>>>>> > >>>>>>> semi-structured data such as JSON, Avro, etc.
>>>> By
>>>> > > >>>>>>>>>>> encoding semi-structured
>>>> > > >>>>>>>>>>> > >>>>>>> data as a variant column, we retain the
>>>> flexibility of
>>>> > > >>>>>>>>>>> the source data,
>>>> > > >>>>>>>>>>> > >>>>>>> while allowing query engines to more
>>>> efficiently
>>>> > > >>>>>>>>>>> operate on the data.
>>>> > > >>>>>>>>>>> > >>>>>>> Snowflake has supported the variant data type
>>>> on
>>>> > > >>>>>>>>>>> Snowflake tables for many
>>>> > > >>>>>>>>>>> > >>>>>>> years [1]. As more and more users utilize
>>>> Iceberg
>>>> > > >>>>>>>>>>> tables in Snowflake,
>>>> > > >>>>>>>>>>> > >>>>>>> we’re hearing an increasing chorus of
>>>> requests for
>>>> > > >>>>>>>>>>> variant support.
>>>> > > >>>>>>>>>>> > >>>>>>> Additionally, other query engines such as
>>>> Apache Spark
>>>> > > >>>>>>>>>>> have begun adding
>>>> > > >>>>>>>>>>> > >>>>>>> variant support [2]. As such, we believe it
>>>> would be
>>>> > > >>>>>>>>>>> beneficial to the
>>>> > > >>>>>>>>>>> > >>>>>>> Iceberg community as a whole to standardize
>>>> on the
>>>> > > >>>>>>>>>>> variant data type
>>>> > > >>>>>>>>>>> > >>>>>>> encoding used across Iceberg tables.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > One specific point to make here is that,
>>>> since an
>>>> > > >>>>>>>>>>> Apache OSS
>>>> > > >>>>>>>>>>> > >>>>>>> version of variant encoding already exists in
>>>> Spark,
>>>> > > >>>>>>>>>>> it likely makes sense
>>>> > > >>>>>>>>>>> > >>>>>>> to simply adopt the Spark encoding as the
>>>> Iceberg
>>>> > > >>>>>>>>>>> standard as well. The
>>>> > > >>>>>>>>>>> > >>>>>>> encoding we use internally today in Snowflake
>>>> is
>>>> > > >>>>>>>>>>> slightly different, but
>>>> > > >>>>>>>>>>> > >>>>>>> essentially equivalent, and we see no
>>>> particular value
>>>> > > >>>>>>>>>>> in trying to clutter
>>>> > > >>>>>>>>>>> > >>>>>>> the space with another
>>>> equivalent-but-incompatible
>>>> > > >>>>>>>>>>> encoding.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > 2. Subcolumnarization
>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization of variant columns
>>>> allows query
>>>> > > >>>>>>>>>>> engines to
>>>> > > >>>>>>>>>>> > >>>>>>> efficiently prune datasets when subcolumns
>>>> (i.e.,
>>>> > > >>>>>>>>>>> nested fields) within a
>>>> > > >>>>>>>>>>> > >>>>>>> variant column are queried, and also allows
>>>> optionally
>>>> > > >>>>>>>>>>> materializing some
>>>> > > >>>>>>>>>>> > >>>>>>> of the nested fields as a column on their own,
>>>> > > >>>>>>>>>>> affording queries on these
>>>> > > >>>>>>>>>>> > >>>>>>> subcolumns the ability to read less data and
>>>> spend
>>>> > > >>>>>>>>>>> less CPU on extraction.
>>>> > > >>>>>>>>>>> > >>>>>>> When subcolumnarizing, the system managing
>>>> table
>>>> > > >>>>>>>>>>> metadata and data tracks
>>>> > > >>>>>>>>>>> > >>>>>>> individual pruning statistics (min, max,
>>>> null, etc.)
>>>> > > >>>>>>>>>>> for some subset of the
>>>> > > >>>>>>>>>>> > >>>>>>> nested fields within a variant, and also
>>>> manages any
>>>> > > >>>>>>>>>>> optional
>>>> > > >>>>>>>>>>> > >>>>>>> materialization. Without subcolumnarization,
>>>> any query
>>>> > > >>>>>>>>>>> which touches a
>>>> > > >>>>>>>>>>> > >>>>>>> variant column must read, parse, extract, and
>>>> filter
>>>> > > >>>>>>>>>>> every row for which
>>>> > > >>>>>>>>>>> > >>>>>>> that column is non-null. Thus, by providing a
>>>> > > >>>>>>>>>>> standardized way of tracking
>>>> > > >>>>>>>>>>> > >>>>>>> subcolum metadata and data for variant
>>>> columns,
>>>> > > >>>>>>>>>>> Iceberg can make
>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnar optimizations accessible across
>>>> various
>>>> > > >>>>>>>>>>> catalogs and query
>>>> > > >>>>>>>>>>> > >>>>>>> engines.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > Subcolumnarization is a non-trivial topic,
>>>> so we
>>>> > > >>>>>>>>>>> expect any
>>>> > > >>>>>>>>>>> > >>>>>>> concrete proposal to include not only the set
>>>> of
>>>> > > >>>>>>>>>>> changes to Iceberg
>>>> > > >>>>>>>>>>> > >>>>>>> metadata that allow compatible query engines
>>>> to
>>>> > > >>>>>>>>>>> interopate on
>>>> > > >>>>>>>>>>> > >>>>>>> subcolumnarization data for variant columns,
>>>> but also
>>>> > > >>>>>>>>>>> reference
>>>> > > >>>>>>>>>>> > >>>>>>> documentation explaining subcolumnarization
>>>> principles
>>>> > > >>>>>>>>>>> and recommended best
>>>> > > >>>>>>>>>>> > >>>>>>> practices.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > It sounds like the recent Geo proposal [3]
>>>> may be a
>>>> > > >>>>>>>>>>> good starting
>>>> > > >>>>>>>>>>> > >>>>>>> point for how to approach this, so our plan
>>>> is to
>>>> > > >>>>>>>>>>> write something up in
>>>> > > >>>>>>>>>>> > >>>>>>> that vein that covers the proposed spec
>>>> changes,
>>>> > > >>>>>>>>>>> backwards compatibility,
>>>> > > >>>>>>>>>>> > >>>>>>> implementor burdens, etc. But we wanted to
>>>> first reach
>>>> > > >>>>>>>>>>> out to the community
>>>> > > >>>>>>>>>>> > >>>>>>> to introduce ourselves and the idea, and see
>>>> if
>>>> > > >>>>>>>>>>> there’s any early feedback
>>>> > > >>>>>>>>>>> > >>>>>>> we should incorporate before we spend too
>>>> much time on
>>>> > > >>>>>>>>>>> a concrete proposal.
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > Thank you!
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > [1]
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>>
>>>> https://docs.snowflake.com/en/sql-reference/data-types-semistructured
>>>> > > >>>>>>>>>>> > >>>>>>> > [2]
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>>
>>>> https://github.com/apache/spark/blob/master/common/variant/README.md
>>>> > > >>>>>>>>>>> > >>>>>>> > [3]
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>>
>>>> https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>> > -Tyler, Nileema, Selcuk, Aihua
>>>> > > >>>>>>>>>>> > >>>>>>> >
>>>> > > >>>>>>>>>>> > >>>>>>>
>>>> > > >>>>>>>>>>> > >>>>>>
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>> > --
>>>> > > >>>>>>>>>>> > Ryan Blue
>>>> > > >>>>>>>>>>> > Databricks
>>>> > > >>>>>>>>>>> >
>>>> > > >>>>>>>>>>>
>>>> > > >>>>>>>>>>
>>>> > > >>>>>>>>>
>>>> > > >>>>>>>>> --
>>>> > > >>>>>>>>> Ryan Blue
>>>> > > >>>>>>>>> Databricks
>>>> > > >>>>>>>>>
>>>> > > >>>>>>>>
>>>> > > >>>>>>>>
>>>> > > >>>>>>>> --
>>>> > > >>>>>>>> Ryan Blue
>>>> > > >>>>>>>> Databricks
>>>> > > >>>>>>>>
>>>> > > >>>>>>>
>>>> > > >>>>>>
>>>> > > >>>>>> --
>>>> > > >>>>>> Ryan Blue
>>>> > > >>>>>> Databricks
>>>> > > >>>>>>
>>>> > > >>>>>
>>>> > > >>>>
>>>> > > >>>> --
>>>> > > >>>> Ryan Blue
>>>> > > >>>> Databricks
>>>> > > >>>>
>>>> > > >>>
>>>> > > >>
>>>> > > >> --
>>>> > > >> Ryan Blue
>>>> > > >> Databricks
>>>> > > >>
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Databricks
>>
>

Re: [Early Feedback] Variant and Subcolumnarization Support

Reply via email to