I support that whole-heartedly. Parquet would be a great neutral location for the spec.
On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue <b...@databricks.com.invalid> wrote: > I think it's a good idea to reach out to the Spark community and make sure > we are in agreement. Up until now I think we've been thinking more > abstractly about what makes sense but before we make any decision we should > definitely collaborate with the other communities. > > I'd also like to suggest an alternative for where this spec should be > maintained that would hopefully allow us to avoid copying and maintaining > multiple places. As we've already discussed, this is not an easy spec to > find a home for because there are alternative projects that are all > interested. Since this is a cross-engine type, Spark may not be ideal. At > the same time, Delta already supports the variant spec so there's a similar > problem maintaining this in Iceberg. > > I think that a reasonable and neutral option is to see if the Parquet > community would be willing to host the spec and library. That fits with the > spec because subcolumnarization is written assuming Parquet is the storage. > It would also be the best place for broad compatibility because anyone > using Parquet would have a strong motivation to standardize on the same > encoding. > > Initially, I pushed for Iceberg instead of Parquet because we may want to > have the same variant encoding in ORC, but what made me change my mind is > that every layer (file format, table format, engine) has that problem and > I've heard the concern about neutrality raised multiple times while > discussing this question internally. > > I think the Parquet community is the most neutral option available. Would > anyone else support asking the Spark and Parquet communities to maintain > the variant spec in Parquet? > > Ryan > > On Thu, Aug 15, 2024 at 8:34 AM Xuanwo <xua...@apache.org> wrote: > >> From the iceberg-rust perspective, it could be extremely challenging to >> keep track of both the Spark and Iceberg specifications. Having a single >> source of truth would be much better. I believe this change will also >> benefit Delta Lake if they implement the same approach. Perhaps we can try >> contacting them to initiate such a project? >> >> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote: >> >> +1 on posting this discussion to dev@spark ML >> >> > I don't think there is anything that would stop us from moving to a >> joint project in the future >> >> My concern is that if we don't do this from day 1, we will never ever do >> this. >> >> Best, >> Gang >> >> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >> Thats fair @Micah, so far all the discussions have been direct and off >> the dev list. Would you like to make the request on the public Spark Dev >> list? I would be glad to co-sign, I can also draft up a quick email if you >> don't have time. >> >> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> I agree that it would be beneficial to make a sub-project, the main >> problem is political and not logistic. I've been asking for movement from >> other relative projects for a month and we simply haven't gotten anywhere. >> >> >> I just wanted to double check that these issues were brought directly to >> the spark community (i.e. a discussion thread on the Spark developer >> mailing list) and not via backchannels. >> >> I'm not sure the outcome would be different and I don't think this should >> block forking the spec, but we should make sure that the decision is >> publicly documented within both communities. >> >> Thanks, >> Micah >> >> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >> @Gang Wu >> I agree that it would be beneficial to make a sub-project, the main >> problem is political and not logistic. I've been asking for movement from >> other relative projects for a month and we simply haven't gotten anywhere. >> I don't think there is anything that would stop us from moving to a joint >> project in the future and if you know of some way of encouraging that >> movement from other relevant parties I would be glad to collaborate in >> doing that. One thing that I don't want to do is have the Iceberg project >> stay in a holding pattern without any clear roadmap as to how to proceed. >> >> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >> I’m on board with copying the spec into our repository. However, as we’ve >> talked about, it’s not just a straightforward copy—there are already some >> divergences. Some of them are under discussion. Iceberg is definitely the >> best place for these specs. Engines like Trino and Flink can then rely on >> the Iceberg specs as a solid foundation. >> >> Yufei >> >> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> wrote: >> >> Sorry for chiming in late. >> >> From the discussion in >> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >> don't quite understand why it is logistically complicated to create a >> sub-project to hold the variant spec and impl. >> >> IMHO, coping the variant type spec into Apache Iceberg has some >> deficiencies: >> - It is a burden to update two repos if there is a variant type spec >> change and will likely result in deviation if some changes do not reach >> agreement from both parties. >> - Implementers are required to keep an eye on both specs (considering >> proprietary engines where both Iceberg and Delta are supported). >> - Putting the spec and impl of variant type in Iceberg repo does lose the >> opportunity for better native support from file formats like Parquet and >> ORC. >> >> I'm not sure if it is possible to create a separate project (e.g. >> apache/variant-type) to make it a single point of truth. We can learn from >> the experience of Apache Arrow. In this fashion, different engines, table >> formats and file formats can follow the same spec and are free to depend on >> the reference implementations from apache/variant-type or implement their >> own. >> >> Best, >> Gang >> >> >> >> >> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com> wrote: >> >> +1 for copying the spec into our repository, I think we need to own it >> fully as a part of the table spec, and we can build compatibility through >> tests. >> >> -Jack >> >> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >> I'm not really in favor of linking and annotating as that just makes >> things more complicated and still is essentially forking just with more >> steps. If we just track our annotations / modifications to a single >> commit/version then we have the same issue again but now you have to go to >> multiple sources to get the actual Spec. *In addition, our very copy of >> the Spec is going to require new types which don't exist in the Spark Spec >> which necessarily means diverging. *We will need to take up new >> primitive id's (as noted in my first email) >> >> The other issue I have is I don't think the Spark Spec is really going >> through a thorough review process from all members of the Spark community, >> I believe it probably should have gone through the SPIP but instead seems >> to have been merged without broad community involvement. >> >> The only way to truly avoid diverging is to only have a single copy of >> the spec, in our previous discussions the vast majority of Apache Iceberg >> community want it to exist here. >> >> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks <dwe...@apache.org> wrote: >> >> I'm really excited about the introduction of variant type to Iceberg, but >> I want to raise concerns about forking the spec. >> >> I feel like preemptively forking would create the situation where we end >> up diverging because there's little reason to work with both communities to >> evolve in a way that benefits everyone. >> >> I would much rather point to a specific version of the spec and annotate >> any variance in Iceberg's handling. This would allow us to continue >> without dividing the communities. >> >> If at any point there are irreconcilable differences, I would support >> forking, but I don't feel like that should be the initial step. >> >> No one is excited about the possibility that the physical representations >> end up diverging, but it feels like we're setting ourselves up for that >> exact scenario. >> >> -Dan >> >> >> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong <fo...@apache.org> >> wrote: >> >> +1 to what's already being said here. It is good to copy the spec to >> Iceberg and add context that's specific to Iceberg, but at the same time, >> we should maintain compatibility. >> >> Kind regards, >> Fokko >> >> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang <owenzhang1...@gmail.com>: >> >> +1 to copy the spec into our repository. I think the best way to keep >> compatibility is building integration tests. >> >> Thanks, >> Manu >> >> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >> Thanks Russell and Aihua for pushing Variant support! >> >> Given the differences between the supported types and the lack of >> interest from the other project, I think it is reasonable to duplicate the >> specification to our repository. >> I would give very strong emphasis on sticking to the Spark spec as much >> as possible, to keep compatibility as much as possible. Maybe even revert >> to a shared specification if the situation changes. >> >> Thanks, >> Peter >> >> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. aug. 13., K, >> 19:52): >> >> Thanks Russell for bringing this up. >> >> This is the main blocker to move forward with the Variant support in >> Iceberg and hopefully we can have a consensus. To me, I also feel it makes >> more sense to move the spec into Iceberg rather than Spark engine owns it >> and we try to keep it compatible with Spark spec. >> >> Thanks, >> Aihua >> >> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >> Hi Y’all, >> >> We’ve hit a bit of a roadblock with the Variant Proposal, while we were >> hoping to move the Variant and Shredding specifications from Spark into >> Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately, >> I think we have a number of issues with just linking to the Spark project >> directly from within Iceberg and *I believe we need to copy the >> specifications into our repository*. >> >> There are a few reasons why i think this is necessary >> >> First, we have a divergence of types already. The Spark Specification >> already includes types which Iceberg has no definition for (19, 20 >> <https://github.com/apache/spark/blob/master/common/variant/README.md#encoding-types> >> - Interval Types) and Iceberg already has a type which is not included >> within the Spark Specification (Time) and will soon have more with >> TimestampNS, and Geo. >> >> Second, We would like to make sure that Spark is not a hard dependency >> for other engines. We are working with several implementers of the Iceberg >> spec and it has previously been agreed that it would be best if the source >> of truth for Variant existed in an engine and file format neutral location. >> The Iceberg project has a good open model of governance and, as we have >> seen so far discussing Variant >> <https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq>, open >> and active collaboration. This would also help as we can strictly version >> our changes in-line with the rest of the Iceberg spec. >> >> Third, The Shredding spec is not quite finished and requires some group >> analysis and discussion before we commit it. I think again the Iceberg >> community is probably the right place for this to happen as we have already >> started discussions here on these topics. >> >> For these reasons I think we should go with a direct copy of the existing >> specification from the Spark Project and move ahead with our discussions >> and modifications within Iceberg. That said, *I do not want to diverge >> if possible from the Spark proposal*. For example, although we do not >> use the Interval types above, I think we should *not* reuse those type >> ids within our spec. Iceberg's Variant Spec types 19 and 20 would remain >> unused along with any other types we think are not applicable. We should >> strive whenever possible to allow for compatibility. >> >> In the interest of moving forward with this proposal I am hoping to see >> if anyone in the community objects to this plan going forward or has a >> better alternative. >> >> As always I am thankful for your time and am eager to hear back from >> everyone, >> Russ >> >> Xuanwo >> >> https://xuanwo.io/ >> >> > > -- > Ryan Blue > Databricks >