I personally believe arrow is a better choice since we will eventually have the same memory layout but different physical layouts in Parquet, ORC, or other file formats.
One concern about this option I have is whether the Arrow community is willing to make this happen and maintain this specification? > Should we start a vote to move forward? I believe a vote to proceed makes sense since we still haven't reached a consensus on this point. On Thu, Aug 22, 2024, at 14:32, Gang Wu wrote: > It seems that we have reached a consensus to some extent that there > should be a new home for the variant spec. The pending question > is whether Parquet or Arrow is a better choice. As a committer from Arrow, > Parquet and ORC communities, I am neutral to choose any and happy to > help with the movement once a decision has been made. > > Should we start a vote to move forward? > > Best, > Gang > > On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > >> > That being said, I think the most important consideration for now is >> where >> > are the current maintainers / contributors to the variant type. If most >> of >> > them are already PMC members / committers on a project, it becomes a bit >> > easier. Otherwise if there isn't much overlap with a project's existing >> > governance, I worry there could be a bit of friction. How many active >> > contributors are there from Iceberg? And how about from Arrow? >> >> >> I think this is the key question. What are the requirements around >> governance? I've seen some tangential messaging here but I'm not clear on >> what everyone expects. >> >> I think for a lot of the other concerns my view is that the exact project >> does not really matter (and choosing a project with mature cross language >> testing infrastructure or committing to building it is critical). IIUC we >> are talking about following artifacts: >> >> 1. A stand alone specification document (this can be hosted anyplace) >> 2. A set of language bindings with minimal dependencies can be consumed >> downstream (again, as long as dependencies are managed carefully any >> project can host these) >> 3. Potential integration where appropriate into file format libraries to >> support shredding (but as of now this is being bypassed by using >> conventions anyways). My impression is that at least for Parquet there has >> been a proliferation of vectorized readers across different projects, so >> I'm not clear how much standardization in parquet-java could help here. >> >> To respond to some other questions: >> >> Arrow is not used as Spark's in-memory model, nor Trino and others so those >> > existing relationships aren't there. I also worry that differences in >> > approaches would make it difficult later on. >> >> >> While Arrow is not in the core memory model, for Spark I believe it is >> still used for IPC for things like Java<->Python. Trino also consumes Arrow >> libraries today to support things like Snowflake/Bigquery federation. But I >> think this is minor because as mentioned above I think the functional >> libraries would be relatively stand-alone. >> >> Do we think it could be introduced as a canonical extension arrow type? >> >> >> I believe it can be, I think there are probably different layouts that can >> be supported: >> >> 1. A struct with two variable width bytes columns (metadata and value data >> are stored separately and each entry has a 1:1 relationship). >> 2. Shredded (shredded according to the same convention as parquet), I >> would need to double check but I don't think Arrow would have problems here >> but REE would likely be required to make this efficient (i.e. sparse value >> support is important). >> >> In both cases the main complexity is providing the necessary functions for >> manipulation. >> >> Thanks, >> Micah >> >> >> >> >> >> >> >> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com> >> wrote: >> >> > In being more engine and format agnostic, I agree the Arrow project might >> > be a good host for such a specification. It seems like we want to move >> away >> > from hosting in Spark to make it engine agnostic. But moving into Iceberg >> > might make it less format agnostic, as I understand multiple formats >> might >> > want to implement this. I'm not intimately familiar with the state of >> this, >> > but I believe Delta Lake would like to be aligned with the same format as >> > Iceberg. In addition, the Lance format (which I work on), will eventually >> > be interesting as well. It seems equally bad to me to attach this >> > specification to a particular table format as it does a particular query >> > engine. >> > >> > That being said, I think the most important consideration for now is >> where >> > are the current maintainers / contributors to the variant type. If most >> of >> > them are already PMC members / committers on a project, it becomes a bit >> > easier. Otherwise if there isn't much overlap with a project's existing >> > governance, I worry there could be a bit of friction. How many active >> > contributors are there from Iceberg? And how about from Arrow? >> > >> > BTW, I'd add I'm interested in helping develop an Arrow extension type >> for >> > the binary variant type. I've been experimenting with a DataFusion >> > extension that operates on this [1], and already have some ideas on how >> > such an extension type might be defined. I'm not yet caught up on the >> > shredded specification, but I think having just the binary format would >> be >> > beneficial for in-memory analytics, which are most relevant to Arrow. >> I'll >> > be creating a seperate thread on the Arrow ML about this soon. >> > >> > Best, >> > >> > Will Jones >> > >> > [1] >> > >> https://github.com/datafusion-contrib/datafusion-functions-variant/issues >> > >> > >> > On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: >> > >> > > + dev@arrow >> > > >> > > Thanks for all the valuable suggestions! I am inclined to Micah's idea >> > that >> > > Arrow might be a better host compared to Parquet. >> > > >> > > To give more context, I am taking the initiative to add the geometry >> type >> > > to both Parquet and ORC. I'd like to do the same thing for variant type >> > in >> > > that variant type is engine and file format agnostic. This does mean >> that >> > > Parquet might not be the neutral place to hold the variant spec. >> > > >> > > Best, >> > > Gang >> > > >> > > On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li <jingsongl...@gmail.com> >> > > wrote: >> > > >> > > > Thanks all for your discussion. >> > > > >> > > > The Apache Paimon community is also considering support for this >> > > > Variant type, without a doubt, we hope to maintain consistency with >> > > > Iceberg. >> > > > >> > > > Not only the Paimon community, but also various computing engines >> need >> > > > to adapt to this type, such as Flink and StarRocks. We also hope to >> > > > promote them to adapt to this type. >> > > > >> > > > It is worth noting that we also need to standardize many functions >> > > > related to it. >> > > > >> > > > A neutral place to maintain it is a great choice. >> > > > >> > > > - As Gang Wu said, a standalone project is good, just like >> > RoaringBitmap >> > > > [1]. >> > > > - As Ryan said, Parquet community is a neutral option too. >> > > > - As Micah said, Arrow is also an option too. >> > > > >> > > > [1] https://github.com/RoaringBitmap >> > > > >> > > > Best, >> > > > Jingsong >> > > > >> > > > On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < >> emkornfi...@gmail.com >> > > >> > > > wrote: >> > > > >> >> > > > >> Thats fair @Micah, so far all the discussions have been direct and >> > off >> > > > the dev list. Would you like to make the request on the public Spark >> > Dev >> > > > list? I would be glad to co-sign, I can also draft up a quick email >> if >> > > you >> > > > don't have time. >> > > > > >> > > > > >> > > > > I think once we come to consensus, if you have bandwidth, I think >> the >> > > > message might be better coming from you, as you have more context on >> > some >> > > > of the non-public conversations, the requirements from an Iceberg >> > > > perspective on governance and the blockers that were encountered. If >> > > > details on the conversations can't be shared, (i.e. we are starting >> > from >> > > > scratch) it seems like suggesting a new project via SPIP might be the >> > way >> > > > forward. I'm happy to help with that if it is useful but I would >> guess >> > > > Aihua or Tyler might be in a better place to start as it seems they >> > have >> > > > done more serious thinking here. >> > > > > >> > > > > If we decide to try to standardize on Parquet or Arrow I'm happy to >> > > help >> > > > support the effort in those communities. >> > > > > >> > > > > Thanks, >> > > > > Micah >> > > > > >> > > > > On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < >> > > > russell.spit...@gmail.com> wrote: >> > > > >> >> > > > >> Thats fair @Micah, so far all the discussions have been direct and >> > off >> > > > the dev list. Would you like to make the request on the public Spark >> > Dev >> > > > list? I would be glad to co-sign, I can also draft up a quick email >> if >> > > you >> > > > don't have time. >> > > > >> >> > > > >> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < >> > > emkornfi...@gmail.com> >> > > > wrote: >> > > > >>>> >> > > > >>>> I agree that it would be beneficial to make a sub-project, the >> > main >> > > > problem is political and not logistic. I've been asking for movement >> > from >> > > > other relative projects for a month and we simply haven't gotten >> > > anywhere. >> > > > >>> >> > > > >>> >> > > > >>> I just wanted to double check that these issues were brought >> > directly >> > > > to the spark community (i.e. a discussion thread on the Spark >> developer >> > > > mailing list) and not via backchannels. >> > > > >>> >> > > > >>> I'm not sure the outcome would be different and I don't think >> this >> > > > should block forking the spec, but we should make sure that the >> > decision >> > > is >> > > > publicly documented within both communities. >> > > > >>> >> > > > >>> Thanks, >> > > > >>> Micah >> > > > >>> >> > > > >>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >> > > > russell.spit...@gmail.com> wrote: >> > > > >>>> >> > > > >>>> @Gang Wu >> > > > >>>> >> > > > >>>> I agree that it would be beneficial to make a sub-project, the >> > main >> > > > problem is political and not logistic. I've been asking for movement >> > from >> > > > other relative projects for a month and we simply haven't gotten >> > > anywhere. >> > > > I don't think there is anything that would stop us from moving to a >> > joint >> > > > project in the future and if you know of some way of encouraging that >> > > > movement from other relevant parties I would be glad to collaborate >> in >> > > > doing that. One thing that I don't want to do is have the Iceberg >> > project >> > > > stay in a holding pattern without any clear roadmap as to how to >> > proceed. >> > > > >>>> >> > > > >>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu <flyrain...@gmail.com >> > >> > > > wrote: >> > > > >>>>> >> > > > >>>>> I’m on board with copying the spec into our repository. >> However, >> > as >> > > > we’ve talked about, it’s not just a straightforward copy—there are >> > > already >> > > > some divergences. Some of them are under discussion. Iceberg is >> > > definitely >> > > > the best place for these specs. Engines like Trino and Flink can then >> > > rely >> > > > on the Iceberg specs as a solid foundation. >> > > > >>>>> >> > > > >>>>> Yufei >> > > > >>>>> >> > > > >>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> >> > wrote: >> > > > >>>>>> >> > > > >>>>>> Sorry for chiming in late. >> > > > >>>>>> >> > > > >>>>>> From the discussion in >> > > > https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >> > > don't >> > > > quite understand why it is logistically complicated to create a >> > > sub-project >> > > > to hold the variant spec and impl. >> > > > >>>>>> >> > > > >>>>>> IMHO, coping the variant type spec into Apache Iceberg has >> some >> > > > deficiencies: >> > > > >>>>>> - It is a burden to update two repos if there is a variant >> type >> > > > spec change and will likely result in deviation if some changes do >> not >> > > > reach agreement from both parties. >> > > > >>>>>> - Implementers are required to keep an eye on both specs >> > > > (considering proprietary engines where both Iceberg and Delta are >> > > > supported). >> > > > >>>>>> - Putting the spec and impl of variant type in Iceberg repo >> does >> > > > lose the opportunity for better native support from file formats like >> > > > Parquet and ORC. >> > > > >>>>>> >> > > > >>>>>> I'm not sure if it is possible to create a separate project >> > (e.g. >> > > > apache/variant-type) to make it a single point of truth. We can learn >> > > from >> > > > the experience of Apache Arrow. In this fashion, different engines, >> > table >> > > > formats and file formats can follow the same spec and are free to >> > depend >> > > on >> > > > the reference implementations from apache/variant-type or implement >> > their >> > > > own. >> > > > >>>>>> >> > > > >>>>>> Best, >> > > > >>>>>> Gang >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> >> > > > >>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye <yezhao...@gmail.com >> > >> > > > wrote: >> > > > >>>>>>> >> > > > >>>>>>> +1 for copying the spec into our repository, I think we need >> to >> > > > own it fully as a part of the table spec, and we can build >> > compatibility >> > > > through tests. >> > > > >>>>>>> >> > > > >>>>>>> -Jack >> > > > >>>>>>> >> > > > >>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >> > > > russell.spit...@gmail.com> wrote: >> > > > >>>>>>>> >> > > > >>>>>>>> I'm not really in favor of linking and annotating as that >> just >> > > > makes things more complicated and still is essentially forking just >> > with >> > > > more steps. If we just track our annotations / modifications to a >> > single >> > > > commit/version then we have the same issue again but now you have to >> go >> > > to >> > > > multiple sources to get the actual Spec. In addition, our very copy >> of >> > > the >> > > > Spec is going to require new types which don't exist in the Spark >> Spec >> > > > which necessarily means diverging. We will need to take up new >> > primitive >> > > > id's (as noted in my first email) >> > > > >>>>>>>> >> > > > >>>>>>>> The other issue I have is I don't think the Spark Spec is >> > really >> > > > going through a thorough review process from all members of the Spark >> > > > community, I believe it probably should have gone through the SPIP >> but >> > > > instead seems to have been merged without broad community >> involvement. >> > > > >>>>>>>> >> > > > >>>>>>>> The only way to truly avoid diverging is to only have a >> single >> > > > copy of the spec, in our previous discussions the vast majority of >> > Apache >> > > > Iceberg community want it to exist here. >> > > > >>>>>>>> >> > > > >>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < >> > dwe...@apache.org >> > > > >> > > > wrote: >> > > > >>>>>>>>> >> > > > >>>>>>>>> I'm really excited about the introduction of variant type >> to >> > > > Iceberg, but I want to raise concerns about forking the spec. >> > > > >>>>>>>>> >> > > > >>>>>>>>> I feel like preemptively forking would create the situation >> > > > where we end up diverging because there's little reason to work with >> > both >> > > > communities to evolve in a way that benefits everyone. >> > > > >>>>>>>>> >> > > > >>>>>>>>> I would much rather point to a specific version of the spec >> > and >> > > > annotate any variance in Iceberg's handling. This would allow us to >> > > > continue without dividing the communities. >> > > > >>>>>>>>> >> > > > >>>>>>>>> If at any point there are irreconcilable differences, I >> would >> > > > support forking, but I don't feel like that should be the initial >> step. >> > > > >>>>>>>>> >> > > > >>>>>>>>> No one is excited about the possibility that the physical >> > > > representations end up diverging, but it feels like we're setting >> > > ourselves >> > > > up for that exact scenario. >> > > > >>>>>>>>> >> > > > >>>>>>>>> -Dan >> > > > >>>>>>>>> >> > > > >>>>>>>>> >> > > > >>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < >> > > > fo...@apache.org> wrote: >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> +1 to what's already being said here. It is good to copy >> the >> > > > spec to Iceberg and add context that's specific to Iceberg, but at >> the >> > > same >> > > > time, we should maintain compatibility. >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> Kind regards, >> > > > >>>>>>>>>> Fokko >> > > > >>>>>>>>>> >> > > > >>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >> > > > owenzhang1...@gmail.com>: >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> +1 to copy the spec into our repository. I think the best >> > way >> > > > to keep compatibility is building integration tests. >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> Thanks, >> > > > >>>>>>>>>>> Manu >> > > > >>>>>>>>>>> >> > > > >>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >> > > > peter.vary.apa...@gmail.com> wrote: >> > > > >>>>>>>>>>>> >> > > > >>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >> > > > >>>>>>>>>>>> >> > > > >>>>>>>>>>>> Given the differences between the supported types and >> the >> > > > lack of interest from the other project, I think it is reasonable to >> > > > duplicate the specification to our repository. >> > > > >>>>>>>>>>>> I would give very strong emphasis on sticking to the >> Spark >> > > > spec as much as possible, to keep compatibility as much as possible. >> > > Maybe >> > > > even revert to a shared specification if the situation changes. >> > > > >>>>>>>>>>>> >> > > > >>>>>>>>>>>> Thanks, >> > > > >>>>>>>>>>>> Peter >> > > > >>>>>>>>>>>> >> > > > >>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. >> > aug. >> > > > 13., K, 19:52): >> > > > >>>>>>>>>>>>> >> > > > >>>>>>>>>>>>> Thanks Russell for bringing this up. >> > > > >>>>>>>>>>>>> >> > > > >>>>>>>>>>>>> This is the main blocker to move forward with the >> Variant >> > > > support in Iceberg and hopefully we can have a consensus. To me, I >> also >> > > > feel it makes more sense to move the spec into Iceberg rather than >> > Spark >> > > > engine owns it and we try to keep it compatible with Spark spec. >> > > > >>>>>>>>>>>>> >> > > > >>>>>>>>>>>>> Thanks, >> > > > >>>>>>>>>>>>> Aihua >> > > > >>>>>>>>>>>>> >> > > > >>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >> > > > russell.spit...@gmail.com> wrote: >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> Hi Y’all, >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant >> > Proposal, >> > > > while we were hoping to move the Variant and Shredding specifications >> > > from >> > > > Spark into Iceberg there doesn’t seem to be a lot of interest in >> that. >> > > > Unfortunately, I think we have a number of issues with just linking >> to >> > > the >> > > > Spark project directly from within Iceberg and I believe we need to >> > copy >> > > > the specifications into our repository. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> There are a few reasons why i think this is necessary >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> First, we have a divergence of types already. The >> Spark >> > > > Specification already includes types which Iceberg has no definition >> > for >> > > > (19, 20 - Interval Types) and Iceberg already has a type which is not >> > > > included within the Spark Specification (Time) and will soon have >> more >> > > with >> > > > TimestampNS, and Geo. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> Second, We would like to make sure that Spark is not a >> > > hard >> > > > dependency for other engines. We are working with several >> implementers >> > of >> > > > the Iceberg spec and it has previously been agreed that it would be >> > best >> > > if >> > > > the source of truth for Variant existed in an engine and file format >> > > > neutral location. The Iceberg project has a good open model of >> > governance >> > > > and, as we have seen so far discussing Variant, open and active >> > > > collaboration. This would also help as we can strictly version our >> > > changes >> > > > in-line with the rest of the Iceberg spec. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and >> > > > requires some group analysis and discussion before we commit it. I >> > think >> > > > again the Iceberg community is probably the right place for this to >> > > happen >> > > > as we have already started discussions here on these topics. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> For these reasons I think we should go with a direct >> > copy >> > > > of the existing specification from the Spark Project and move ahead >> > with >> > > > our discussions and modifications within Iceberg. That said, I do not >> > > want >> > > > to diverge if possible from the Spark proposal. For example, although >> > we >> > > do >> > > > not use the Interval types above, I think we should not reuse those >> > type >> > > > ids within our spec. Iceberg's Variant Spec types 19 and 20 would >> > remain >> > > > unused along with any other types we think are not applicable. We >> > should >> > > > strive whenever possible to allow for compatibility. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> In the interest of moving forward with this proposal I >> > am >> > > > hoping to see if anyone in the community objects to this plan going >> > > forward >> > > > or has a better alternative. >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> As always I am thankful for your time and am eager to >> > hear >> > > > back from everyone, >> > > > >>>>>>>>>>>>>> Russ >> > > > >>>>>>>>>>>>>> >> > > > >>>>>>>>>>>>>> >> > > > >> > > >> > >> -- Xuanwo https://xuanwo.io/