Re: [DISCUSS] Variant Spec Location

Russell Spitzer Thu, 15 Aug 2024 11:30:57 -0700

I support that whole-heartedly. Parquet would be a great neutral location
for the spec.


On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> I think it's a good idea to reach out to the Spark community and make sure
> we are in agreement. Up until now I think we've been thinking more
> abstractly about what makes sense but before we make any decision we should
> definitely collaborate with the other communities.
>
> I'd also like to suggest an alternative for where this spec should be
> maintained that would hopefully allow us to avoid copying and maintaining
> multiple places. As we've already discussed, this is not an easy spec to
> find a home for because there are alternative projects that are all
> interested. Since this is a cross-engine type, Spark may not be ideal. At
> the same time, Delta already supports the variant spec so there's a similar
> problem maintaining this in Iceberg.
>
> I think that a reasonable and neutral option is to see if the Parquet
> community would be willing to host the spec and library. That fits with the
> spec because subcolumnarization is written assuming Parquet is the storage.
> It would also be the best place for broad compatibility because anyone
> using Parquet would have a strong motivation to standardize on the same
> encoding.
>
> Initially, I pushed for Iceberg instead of Parquet because we may want to
> have the same variant encoding in ORC, but what made me change my mind is
> that every layer (file format, table format, engine) has that problem and
> I've heard the concern about neutrality raised multiple times while
> discussing this question internally.
>
> I think the Parquet community is the most neutral option available. Would
> anyone else support asking the Spark and Parquet communities to maintain
> the variant spec in Parquet?
>
> Ryan
>
> On Thu, Aug 15, 2024 at 8:34 AM Xuanwo <xua...@apache.org> wrote:
>
>> From the iceberg-rust perspective, it could be extremely challenging to
>> keep track of both the Spark and Iceberg specifications. Having a single
>> source of truth would be much better. I believe this change will also
>> benefit Delta Lake if they implement the same approach. Perhaps we can try
>> contacting them to initiate such a project?
>>
>> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote:
>>
>> +1 on posting this discussion to dev@spark ML
>>
>> > I don't think there is anything that would stop us from moving to a
>> joint project in the future
>>
>> My concern is that if we don't do this from day 1, we will never ever do
>> this.
>>
>> Best,
>> Gang
>>
>> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> Thats fair @Micah, so far all the discussions have been direct and off
>> the dev list. Would you like to make the request on the public Spark Dev
>> list? I would be glad to co-sign, I can also draft up a quick email if you
>> don't have time.
>>
>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>>
>>
>> I just wanted to double check that these issues were brought directly to
>> the spark community (i.e. a discussion thread on the Spark developer
>> mailing list) and not via backchannels.
>>
>> I'm not sure the outcome would be different and I don't think this should
>> block forking the spec, but we should make sure that the decision is
>> publicly documented within both communities.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> @Gang Wu
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>> I don't think there is anything that would stop us from moving to a joint
>> project in the future and if you know of some way of encouraging that
>> movement from other relevant parties I would be glad to collaborate in
>> doing that. One thing that I don't want to do is have the Iceberg project
>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>
>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>> I’m on board with copying the spec into our repository. However, as we’ve
>> talked about, it’s not just a straightforward copy—there are already some
>> divergences. Some of them are under discussion. Iceberg is definitely the
>> best place for these specs. Engines like Trino and Flink can then rely on
>> the Iceberg specs as a solid foundation.
>>
>> Yufei
>>
>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote:
>>
>> Sorry for chiming in late.
>>
>> From the discussion in
>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>> don't quite understand why it is logistically complicated to create a
>> sub-project to hold the variant spec and impl.
>>
>> IMHO, coping the variant type spec into Apache Iceberg has some
>> deficiencies:
>> - It is a burden to update two repos if there is a variant type spec
>> change and will likely result in deviation if some changes do not reach
>> agreement from both parties.
>> - Implementers are required to keep an eye on both specs (considering
>> proprietary engines where both Iceberg and Delta are supported).
>> - Putting the spec and impl of variant type in Iceberg repo does lose the
>> opportunity for better native support from file formats like Parquet and
>> ORC.
>>
>> I'm not sure if it is possible to create a separate project (e.g.
>> apache/variant-type) to make it a single point of truth. We can learn from
>> the experience of Apache Arrow. In this fashion, different engines, table
>> formats and file formats can follow the same spec and are free to depend on
>> the reference implementations from apache/variant-type or implement their
>> own.
>>
>> Best,
>> Gang
>>
>>
>>
>>
>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>> +1 for copying the spec into our repository, I think we need to own it
>> fully as a part of the table spec, and we can build compatibility through
>> tests.
>>
>> -Jack
>>
>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> I'm not really in favor of linking and annotating as that just makes
>> things more complicated and still is essentially forking just with more
>> steps. If we just track our annotations / modifications  to a single
>> commit/version then we have the same issue again but now you have to go to
>> multiple sources to get the actual Spec. *In addition, our very copy of
>> the Spec is going to require new types which don't exist in the Spark Spec
>> which necessarily means diverging. *We will need to take up new
>> primitive id's (as noted in my first email)
>>
>> The other issue I have is I don't think the Spark Spec is really going
>> through a thorough review process from all members of the Spark community,
>> I believe it probably should have gone through the SPIP but instead seems
>> to have been merged without broad community involvement.
>>
>> The only way to truly avoid diverging is to only have a single copy of
>> the spec, in our previous discussions the vast majority of Apache Iceberg
>> community want it to exist here.
>>
>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>> I'm really excited about the introduction of variant type to Iceberg, but
>> I want to raise concerns about forking the spec.
>>
>> I feel like preemptively forking would create the situation where we end
>> up diverging because there's little reason to work with both communities to
>> evolve in a way that benefits everyone.
>>
>> I would much rather point to a specific version of the spec and annotate
>> any variance in Iceberg's handling.  This would allow us to continue
>> without dividing the communities.
>>
>> If at any point there are irreconcilable differences, I would support
>> forking, but I don't feel like that should be the initial step.
>>
>> No one is excited about the possibility that the physical representations
>> end up diverging, but it feels like we're setting ourselves up for that
>> exact scenario.
>>
>> -Dan
>>
>>
>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>> +1 to what's already being said here. It is good to copy the spec to
>> Iceberg and add context that's specific to Iceberg, but at the same time,
>> we should maintain compatibility.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <owenzhang1...@gmail.com>:
>>
>> +1 to copy the spec into our repository. I think the best way to keep
>> compatibility is building integration tests.
>>
>> Thanks,
>> Manu
>>
>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>> Thanks Russell and Aihua for pushing Variant support!
>>
>> Given the differences between the supported types and the lack of
>> interest from the other project, I think it is reasonable to duplicate the
>> specification to our repository.
>> I would give very strong emphasis on sticking to the Spark spec as much
>> as possible, to keep compatibility as much as possible. Maybe even revert
>> to a shared specification if the situation changes.
>>
>> Thanks,
>> Peter
>>
>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K,
>> 19:52):
>>
>> Thanks Russell for bringing this up.
>>
>> This is the main blocker to move forward with the Variant support in
>> Iceberg and hopefully we can have a consensus. To me, I also feel it makes
>> more sense to move the spec into Iceberg rather than Spark engine owns it
>> and we try to keep it compatible with Spark spec.
>>
>> Thanks,
>> Aihua
>>
>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> Hi Y’all,
>>
>> We’ve hit a bit of a roadblock with the Variant Proposal, while we were
>> hoping to move the Variant and Shredding specifications from Spark into
>> Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately,
>> I think we have a number of issues with just linking to the Spark project
>> directly from within Iceberg and *I believe we need to copy the
>> specifications into our repository*.
>>
>> There are a few reasons why i think this is necessary
>>
>> First, we have a divergence of types already. The Spark Specification
>> already includes types which Iceberg has no definition for (19, 20
>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>> - Interval Types) and Iceberg already has a type which is not included
>> within the Spark Specification (Time) and will soon have more with
>> TimestampNS, and Geo.
>>
>> Second, We would like to make sure that Spark is not a hard dependency
>> for other engines. We are working with several implementers of the Iceberg
>> spec and it has previously been agreed that it would be best if the source
>> of truth for Variant existed in an engine and file format neutral location.
>> The Iceberg project has a good open model of governance and, as we have
>> seen so far discussing Variant
>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, open
>> and active collaboration. This would also help as we can strictly version
>> our changes in-line with the rest of the Iceberg spec.
>>
>> Third, The Shredding spec is not quite finished and requires some group
>> analysis and discussion before we commit it. I think again the Iceberg
>> community is probably the right place for this to happen as we have already
>> started discussions here on these topics.
>>
>> For these reasons I think we should go with a direct copy of the existing
>> specification from the Spark Project and move ahead with our discussions
>> and modifications within Iceberg. That said, *I do not want to diverge
>> if possible from the Spark proposal*. For example, although we do not
>> use the Interval types above, I think we should *not* reuse those type
>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would remain
>> unused along with any other types we think are not applicable. We should
>> strive whenever possible to allow for compatibility.
>>
>> In the interest of moving forward with this proposal I am hoping to see
>> if anyone in the community objects to this plan going forward or has a
>> better alternative.
>>
>> As always I am thankful for your time and am eager to hear back from
>> everyone,
>> Russ
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: [DISCUSS] Variant Spec Location

Reply via email to