Thats fair @Micah, so far all the discussions have been direct and off the dev list. Would you like to make the request on the public Spark Dev list? I would be glad to co-sign, I can also draft up a quick email if you don't have time.
On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I agree that it would be beneficial to make a sub-project, the main >> problem is political and not logistic. I've been asking for movement from >> other relative projects for a month and we simply haven't gotten anywhere. > > > I just wanted to double check that these issues were brought directly to > the spark community (i.e. a discussion thread on the Spark developer > mailing list) and not via backchannels. > > I'm not sure the outcome would be different and I don't think this should > block forking the spec, but we should make sure that the decision is > publicly documented within both communities. > > Thanks, > Micah > > On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> @Gang Wu >> >> I agree that it would be beneficial to make a sub-project, the main >> problem is political and not logistic. I've been asking for movement from >> other relative projects for a month and we simply haven't gotten anywhere. >> I don't think there is anything that would stop us from moving to a joint >> project in the future and if you know of some way of encouraging that >> movement from other relevant parties I would be glad to collaborate in >> doing that. One thing that I don't want to do is have the Iceberg project >> stay in a holding pattern without any clear roadmap as to how to proceed. >> >> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> I’m on board with copying the spec into our repository. However, as >>> we’ve talked about, it’s not just a straightforward copy—there are already >>> some divergences. Some of them are under discussion. Iceberg is definitely >>> the best place for these specs. Engines like Trino and Flink can then rely >>> on the Iceberg specs as a solid foundation. >>> >>> Yufei >>> >>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote: >>> >>>> Sorry for chiming in late. >>>> >>>> From the discussion in >>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>>> don't quite understand why it is logistically complicated to create a >>>> sub-project to hold the variant spec and impl. >>>> >>>> IMHO, coping the variant type spec into Apache Iceberg has some >>>> deficiencies: >>>> - It is a burden to update two repos if there is a variant type spec >>>> change and will likely result in deviation if some changes do not reach >>>> agreement from both parties. >>>> - Implementers are required to keep an eye on both specs (considering >>>> proprietary engines where both Iceberg and Delta are supported). >>>> - Putting the spec and impl of variant type in Iceberg repo does lose >>>> the opportunity for better native support from file formats like Parquet >>>> and ORC. >>>> >>>> I'm not sure if it is possible to create a separate project (e.g. >>>> apache/variant-type) to make it a single point of truth. We can learn from >>>> the experience of Apache Arrow. In this fashion, different engines, table >>>> formats and file formats can follow the same spec and are free to depend on >>>> the reference implementations from apache/variant-type or implement their >>>> own. >>>> >>>> Best, >>>> Gang >>>> >>>> >>>> >>>> >>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> +1 for copying the spec into our repository, I think we need to own it >>>>> fully as a part of the table spec, and we can build compatibility through >>>>> tests. >>>>> >>>>> -Jack >>>>> >>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>> I'm not really in favor of linking and annotating as that just makes >>>>>> things more complicated and still is essentially forking just with more >>>>>> steps. If we just track our annotations / modifications to a single >>>>>> commit/version then we have the same issue again but now you have to go >>>>>> to >>>>>> multiple sources to get the actual Spec. *In addition, our very copy >>>>>> of the Spec is going to require new types which don't exist in the Spark >>>>>> Spec which necessarily means diverging. *We will need to take up new >>>>>> primitive id's (as noted in my first email) >>>>>> >>>>>> The other issue I have is I don't think the Spark Spec is really >>>>>> going through a thorough review process from all members of the Spark >>>>>> community, I believe it probably should have gone through the SPIP but >>>>>> instead seems to have been merged without broad community involvement. >>>>>> >>>>>> The only way to truly avoid diverging is to only have a single copy >>>>>> of the spec, in our previous discussions the vast majority of Apache >>>>>> Iceberg community want it to exist here. >>>>>> >>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I'm really excited about the introduction of variant type to >>>>>>> Iceberg, but I want to raise concerns about forking the spec. >>>>>>> >>>>>>> I feel like preemptively forking would create the situation where we >>>>>>> end up diverging because there's little reason to work with both >>>>>>> communities to evolve in a way that benefits everyone. >>>>>>> >>>>>>> I would much rather point to a specific version of the spec and >>>>>>> annotate any variance in Iceberg's handling. This would allow us to >>>>>>> continue without dividing the communities. >>>>>>> >>>>>>> If at any point there are irreconcilable differences, I would >>>>>>> support forking, but I don't feel like that should be the initial step. >>>>>>> >>>>>>> No one is excited about the possibility that the physical >>>>>>> representations end up diverging, but it feels like we're setting >>>>>>> ourselves up for that exact scenario. >>>>>>> >>>>>>> -Dan >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 to what's already being said here. It is good to copy the spec >>>>>>>> to Iceberg and add context that's specific to Iceberg, but at the same >>>>>>>> time, we should maintain compatibility. >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Fokko >>>>>>>> >>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>>>>>>> owenzhang1...@gmail.com>: >>>>>>>> >>>>>>>>> +1 to copy the spec into our repository. I think the best way to >>>>>>>>> keep compatibility is building integration tests. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Manu >>>>>>>>> >>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >>>>>>>>>> >>>>>>>>>> Given the differences between the supported types and the lack of >>>>>>>>>> interest from the other project, I think it is reasonable to >>>>>>>>>> duplicate the >>>>>>>>>> specification to our repository. >>>>>>>>>> I would give very strong emphasis on sticking to the Spark spec >>>>>>>>>> as much as possible, to keep compatibility as much as possible. >>>>>>>>>> Maybe even >>>>>>>>>> revert to a shared specification if the situation changes. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Peter >>>>>>>>>> >>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., >>>>>>>>>> K, 19:52): >>>>>>>>>> >>>>>>>>>>> Thanks Russell for bringing this up. >>>>>>>>>>> >>>>>>>>>>> This is the main blocker to move forward with the Variant >>>>>>>>>>> support in Iceberg and hopefully we can have a consensus. To me, I >>>>>>>>>>> also >>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather than >>>>>>>>>>> Spark >>>>>>>>>>> engine owns it and we try to keep it compatible with Spark spec. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Aihua >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Y’all, >>>>>>>>>>>> >>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while >>>>>>>>>>>> we were hoping to move the Variant and Shredding specifications >>>>>>>>>>>> from Spark >>>>>>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that. >>>>>>>>>>>> Unfortunately, I think we have a number of issues with just >>>>>>>>>>>> linking to the >>>>>>>>>>>> Spark project directly from within Iceberg and *I believe we >>>>>>>>>>>> need to copy the specifications into our repository*. >>>>>>>>>>>> >>>>>>>>>>>> There are a few reasons why i think this is necessary >>>>>>>>>>>> >>>>>>>>>>>> First, we have a divergence of types already. The Spark >>>>>>>>>>>> Specification already includes types which Iceberg has no >>>>>>>>>>>> definition for (19, >>>>>>>>>>>> 20 >>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not >>>>>>>>>>>> included >>>>>>>>>>>> within the Spark Specification (Time) and will soon have more with >>>>>>>>>>>> TimestampNS, and Geo. >>>>>>>>>>>> >>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard >>>>>>>>>>>> dependency for other engines. We are working with several >>>>>>>>>>>> implementers of >>>>>>>>>>>> the Iceberg spec and it has previously been agreed that it would >>>>>>>>>>>> be best if >>>>>>>>>>>> the source of truth for Variant existed in an engine and file >>>>>>>>>>>> format >>>>>>>>>>>> neutral location. The Iceberg project has a good open model of >>>>>>>>>>>> governance >>>>>>>>>>>> and, as we have seen so far discussing Variant >>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, >>>>>>>>>>>> open and active collaboration. This would also help as we can >>>>>>>>>>>> strictly >>>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec. >>>>>>>>>>>> >>>>>>>>>>>> Third, The Shredding spec is not quite finished and requires >>>>>>>>>>>> some group analysis and discussion before we commit it. I think >>>>>>>>>>>> again the >>>>>>>>>>>> Iceberg community is probably the right place for this to happen >>>>>>>>>>>> as we have >>>>>>>>>>>> already started discussions here on these topics. >>>>>>>>>>>> >>>>>>>>>>>> For these reasons I think we should go with a direct copy of >>>>>>>>>>>> the existing specification from the Spark Project and move ahead >>>>>>>>>>>> with our >>>>>>>>>>>> discussions and modifications within Iceberg. That said, *I do >>>>>>>>>>>> not want to diverge if possible from the Spark proposal*. For >>>>>>>>>>>> example, although we do not use the Interval types above, I think >>>>>>>>>>>> we should >>>>>>>>>>>> not reuse those type ids within our spec. Iceberg's Variant >>>>>>>>>>>> Spec types 19 and 20 would remain unused along with any other >>>>>>>>>>>> types we >>>>>>>>>>>> think are not applicable. We should strive whenever possible to >>>>>>>>>>>> allow for >>>>>>>>>>>> compatibility. >>>>>>>>>>>> >>>>>>>>>>>> In the interest of moving forward with this proposal I am >>>>>>>>>>>> hoping to see if anyone in the community objects to this plan >>>>>>>>>>>> going forward >>>>>>>>>>>> or has a better alternative. >>>>>>>>>>>> >>>>>>>>>>>> As always I am thankful for your time and am eager to hear back >>>>>>>>>>>> from everyone, >>>>>>>>>>>> Russ >>>>>>>>>>>> >>>>>>>>>>>>