Thats fair @Micah, so far all the discussions have been direct and off the
dev list. Would you like to make the request on the public Spark Dev list?
I would be glad to co-sign, I can also draft up a quick email if you don't
have time.

On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>
>
> I just wanted to double check that these issues were brought directly to
> the spark community (i.e. a discussion thread on the Spark developer
> mailing list) and not via backchannels.
>
> I'm not sure the outcome would be different and I don't think this should
> block forking the spec, but we should make sure that the decision is
> publicly documented within both communities.
>
> Thanks,
> Micah
>
> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> @Gang Wu
>>
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>> I don't think there is anything that would stop us from moving to a joint
>> project in the future and if you know of some way of encouraging that
>> movement from other relevant parties I would be glad to collaborate in
>> doing that. One thing that I don't want to do is have the Iceberg project
>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>
>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> I’m on board with copying the spec into our repository. However, as
>>> we’ve talked about, it’s not just a straightforward copy—there are already
>>> some divergences. Some of them are under discussion. Iceberg is definitely
>>> the best place for these specs. Engines like Trino and Flink can then rely
>>> on the Iceberg specs as a solid foundation.
>>>
>>> Yufei
>>>
>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote:
>>>
>>>> Sorry for chiming in late.
>>>>
>>>> From the discussion in
>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>> don't quite understand why it is logistically complicated to create a
>>>> sub-project to hold the variant spec and impl.
>>>>
>>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>>> deficiencies:
>>>> - It is a burden to update two repos if there is a variant type spec
>>>> change and will likely result in deviation if some changes do not reach
>>>> agreement from both parties.
>>>> - Implementers are required to keep an eye on both specs (considering
>>>> proprietary engines where both Iceberg and Delta are supported).
>>>> - Putting the spec and impl of variant type in Iceberg repo does lose
>>>> the opportunity for better native support from file formats like Parquet
>>>> and ORC.
>>>>
>>>> I'm not sure if it is possible to create a separate project (e.g.
>>>> apache/variant-type) to make it a single point of truth. We can learn from
>>>> the experience of Apache Arrow. In this fashion, different engines, table
>>>> formats and file formats can follow the same spec and are free to depend on
>>>> the reference implementations from apache/variant-type or implement their
>>>> own.
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> +1 for copying the spec into our repository, I think we need to own it
>>>>> fully as a part of the table spec, and we can build compatibility through
>>>>> tests.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I'm not really in favor of linking and annotating as that just makes
>>>>>> things more complicated and still is essentially forking just with more
>>>>>> steps. If we just track our annotations / modifications  to a single
>>>>>> commit/version then we have the same issue again but now you have to go 
>>>>>> to
>>>>>> multiple sources to get the actual Spec. *In addition, our very copy
>>>>>> of the Spec is going to require new types which don't exist in the Spark
>>>>>> Spec which necessarily means diverging. *We will need to take up new
>>>>>> primitive id's (as noted in my first email)
>>>>>>
>>>>>> The other issue I have is I don't think the Spark Spec is really
>>>>>> going through a thorough review process from all members of the Spark
>>>>>> community, I believe it probably should have gone through the SPIP but
>>>>>> instead seems to have been merged without broad community involvement.
>>>>>>
>>>>>> The only way to truly avoid diverging is to only have a single copy
>>>>>> of the spec, in our previous discussions the vast majority of Apache
>>>>>> Iceberg community want it to exist here.
>>>>>>
>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm really excited about the introduction of variant type to
>>>>>>> Iceberg, but I want to raise concerns about forking the spec.
>>>>>>>
>>>>>>> I feel like preemptively forking would create the situation where we
>>>>>>> end up diverging because there's little reason to work with both
>>>>>>> communities to evolve in a way that benefits everyone.
>>>>>>>
>>>>>>> I would much rather point to a specific version of the spec and
>>>>>>> annotate any variance in Iceberg's handling.  This would allow us to
>>>>>>> continue without dividing the communities.
>>>>>>>
>>>>>>> If at any point there are irreconcilable differences, I would
>>>>>>> support forking, but I don't feel like that should be the initial step.
>>>>>>>
>>>>>>> No one is excited about the possibility that the physical
>>>>>>> representations end up diverging, but it feels like we're setting
>>>>>>> ourselves up for that exact scenario.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 to what's already being said here. It is good to copy the spec
>>>>>>>> to Iceberg and add context that's specific to Iceberg, but at the same
>>>>>>>> time, we should maintain compatibility.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Fokko
>>>>>>>>
>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <
>>>>>>>> owenzhang1...@gmail.com>:
>>>>>>>>
>>>>>>>>> +1 to copy the spec into our repository. I think the best way to
>>>>>>>>> keep compatibility is building integration tests.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Manu
>>>>>>>>>
>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <
>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>>>>>>>
>>>>>>>>>> Given the differences between the supported types and the lack of
>>>>>>>>>> interest from the other project, I think it is reasonable to 
>>>>>>>>>> duplicate the
>>>>>>>>>> specification to our repository.
>>>>>>>>>> I would give very strong emphasis on sticking to the Spark spec
>>>>>>>>>> as much as possible, to keep compatibility as much as possible. 
>>>>>>>>>> Maybe even
>>>>>>>>>> revert to a shared specification if the situation changes.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13.,
>>>>>>>>>> K, 19:52):
>>>>>>>>>>
>>>>>>>>>>> Thanks Russell for bringing this up.
>>>>>>>>>>>
>>>>>>>>>>> This is the main blocker to move forward with the Variant
>>>>>>>>>>> support in Iceberg and hopefully we can have a consensus. To me, I 
>>>>>>>>>>> also
>>>>>>>>>>> feel it makes more sense to move the spec into Iceberg rather than 
>>>>>>>>>>> Spark
>>>>>>>>>>> engine owns it and we try to keep it compatible with Spark spec.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Aihua
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Y’all,
>>>>>>>>>>>>
>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while
>>>>>>>>>>>> we were hoping to move the Variant and Shredding specifications 
>>>>>>>>>>>> from Spark
>>>>>>>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that.
>>>>>>>>>>>> Unfortunately, I think we have a number of issues with just 
>>>>>>>>>>>> linking to the
>>>>>>>>>>>> Spark project directly from within Iceberg and *I believe we
>>>>>>>>>>>> need to copy the specifications into our repository*.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a few reasons why i think this is necessary
>>>>>>>>>>>>
>>>>>>>>>>>> First, we have a divergence of types already. The Spark
>>>>>>>>>>>> Specification already includes types which Iceberg has no 
>>>>>>>>>>>> definition for (19,
>>>>>>>>>>>> 20
>>>>>>>>>>>> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types>
>>>>>>>>>>>> - Interval Types) and Iceberg already has a type which is not 
>>>>>>>>>>>> included
>>>>>>>>>>>> within the Spark Specification (Time) and will soon have more with
>>>>>>>>>>>> TimestampNS, and Geo.
>>>>>>>>>>>>
>>>>>>>>>>>> Second, We would like to make sure that Spark is not a hard
>>>>>>>>>>>> dependency for other engines. We are working with several 
>>>>>>>>>>>> implementers of
>>>>>>>>>>>> the Iceberg spec and it has previously been agreed that it would 
>>>>>>>>>>>> be best if
>>>>>>>>>>>> the source of truth for Variant existed in an engine and file 
>>>>>>>>>>>> format
>>>>>>>>>>>> neutral location. The Iceberg project has a good open model of 
>>>>>>>>>>>> governance
>>>>>>>>>>>> and, as we have seen so far discussing Variant
>>>>>>>>>>>> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>,
>>>>>>>>>>>> open and active collaboration. This would also help as we can 
>>>>>>>>>>>> strictly
>>>>>>>>>>>> version our changes in-line with the rest of the Iceberg spec.
>>>>>>>>>>>>
>>>>>>>>>>>> Third, The Shredding spec is not quite finished and requires
>>>>>>>>>>>> some group analysis and discussion before we commit it. I think 
>>>>>>>>>>>> again the
>>>>>>>>>>>> Iceberg community is probably the right place for this to happen 
>>>>>>>>>>>> as we have
>>>>>>>>>>>> already started discussions here on these topics.
>>>>>>>>>>>>
>>>>>>>>>>>> For these reasons I think we should go with a direct copy of
>>>>>>>>>>>> the existing specification from the Spark Project and move ahead 
>>>>>>>>>>>> with our
>>>>>>>>>>>> discussions and modifications within Iceberg. That said, *I do
>>>>>>>>>>>> not want to diverge if possible from the Spark proposal*. For
>>>>>>>>>>>> example, although we do not use the Interval types above, I think 
>>>>>>>>>>>> we should
>>>>>>>>>>>> not reuse those type ids within our spec. Iceberg's Variant
>>>>>>>>>>>> Spec types 19 and 20 would remain unused along with any other 
>>>>>>>>>>>> types we
>>>>>>>>>>>> think are not applicable. We should strive whenever possible to 
>>>>>>>>>>>> allow for
>>>>>>>>>>>> compatibility.
>>>>>>>>>>>>
>>>>>>>>>>>> In the interest of moving forward with this proposal I am
>>>>>>>>>>>> hoping to see if anyone in the community objects to this plan 
>>>>>>>>>>>> going forward
>>>>>>>>>>>> or has a better alternative.
>>>>>>>>>>>>
>>>>>>>>>>>> As always I am thankful for your time and am eager to hear back
>>>>>>>>>>>> from everyone,
>>>>>>>>>>>> Russ
>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to