>From this thread https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems Spark community is leaning toward moving to Parquet.
Gang, can you help start a discussion in the parquet community on adopting and maintaining such Variant spec? On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org> wrote: > This seems to straddle that line, in that you can also view this as a way > to represent semi-structured data in a manner that allows for more > efficient querying and computation by breaking out some of its components > into a more structured form. > > (I also happen to want a canonical Arrow representation for variant data, > as this type occurs in many databases but doesn't have a great > representation today in ADBC results. That's why I filed [Format] > Consider adding an official variant type to Arrow · Issue #42069 · > apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>. > Of course, there's no specific reason why a canonical Arrow > representation for variants must align with Spark and/or Iceberg.) > > -Curt > > On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Ah, thanks. I've tried to find a rationale and ended up on >> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it >> a good description of what you're after? >> >> If so, then I don't think Arrow is a good match. This seems mostly to be >> a marshalling format for semi-structured data (like Avro?). Arrow data >> types are meant to be in a representation ideal for querying and >> computation, rather than transport and storage. >> >> This could be developed separately and then be represented in Arrow >> using an extension type (perhaps a canonical one as in >> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html). >> >> What do other Arrow developers think? >> >> Regards >> >> Antoine. >> >> >> Le 22/08/2024 à 10:45, Gang Wu a écrit : >> > Sorry for the inconvenience. >> > >> > This is the permalink for the discussion: >> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw >> > >> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> >> wrote: >> > >> >> >> >> Hi Gang, >> >> >> >> Sorry, but can you give a pointer to the start of this discussion >> thread >> >> in a readable format (for example a mailing-list archive)? It appears >> >> that dev@arrow wasn't cc'ed from the start and that can make it >> >> difficult to understand what this is about. >> >> >> >> Regards >> >> >> >> Antoine. >> >> >> >> >> >> Le 22/08/2024 à 08:32, Gang Wu a écrit : >> >>> It seems that we have reached a consensus to some extent that there >> >>> should be a new home for the variant spec. The pending question >> >>> is whether Parquet or Arrow is a better choice. As a committer from >> >> Arrow, >> >>> Parquet and ORC communities, I am neutral to choose any and happy to >> >>> help with the movement once a decision has been made. >> >>> >> >>> Should we start a vote to move forward? >> >>> >> >>> Best, >> >>> Gang >> >>> >> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield < >> emkornfi...@gmail.com> >> >>> wrote: >> >>> >> >>>>> >> >>>>> That being said, I think the most important consideration for now is >> >>>> where >> >>>>> are the current maintainers / contributors to the variant type. If >> most >> >>>> of >> >>>>> them are already PMC members / committers on a project, it becomes a >> >> bit >> >>>>> easier. Otherwise if there isn't much overlap with a project's >> existing >> >>>>> governance, I worry there could be a bit of friction. How many >> active >> >>>>> contributors are there from Iceberg? And how about from Arrow? >> >>>> >> >>>> >> >>>> I think this is the key question. What are the requirements around >> >>>> governance? I've seen some tangential messaging here but I'm not >> clear >> >> on >> >>>> what everyone expects. >> >>>> >> >>>> I think for a lot of the other concerns my view is that the exact >> >> project >> >>>> does not really matter (and choosing a project with mature cross >> >> language >> >>>> testing infrastructure or committing to building it is critical). >> IIUC >> >> we >> >>>> are talking about following artifacts: >> >>>> >> >>>> 1. A stand alone specification document (this can be hosted >> anyplace) >> >>>> 2. A set of language bindings with minimal dependencies can be >> consumed >> >>>> downstream (again, as long as dependencies are managed carefully any >> >>>> project can host these) >> >>>> 3. Potential integration where appropriate into file format >> libraries >> >> to >> >>>> support shredding (but as of now this is being bypassed by using >> >>>> conventions anyways). My impression is that at least for Parquet >> there >> >> has >> >>>> been a proliferation of vectorized readers across different >> projects, so >> >>>> I'm not clear how much standardization in parquet-java could help >> here. >> >>>> >> >>>> To respond to some other questions: >> >>>> >> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others so >> >> those >> >>>>> existing relationships aren't there. I also worry that differences >> in >> >>>>> approaches would make it difficult later on. >> >>>> >> >>>> >> >>>> While Arrow is not in the core memory model, for Spark I believe it >> is >> >>>> still used for IPC for things like Java<->Python. Trino also consumes >> >> Arrow >> >>>> libraries today to support things like Snowflake/Bigquery federation. >> >> But I >> >>>> think this is minor because as mentioned above I think the functional >> >>>> libraries would be relatively stand-alone. >> >>>> >> >>>> Do we think it could be introduced as a canonical extension arrow >> type? >> >>>> >> >>>> >> >>>> I believe it can be, I think there are probably different layouts >> >> that can >> >>>> be supported: >> >>>> >> >>>> 1. A struct with two variable width bytes columns (metadata and >> value >> >> data >> >>>> are stored separately and each entry has a 1:1 relationship). >> >>>> 2. Shredded (shredded according to the same convention as parquet), >> I >> >>>> would need to double check but I don't think Arrow would have >> problems >> >> here >> >>>> but REE would likely be required to make this efficient (i.e. sparse >> >> value >> >>>> support is important). >> >>>> >> >>>> In both cases the main complexity is providing the necessary >> functions >> >> for >> >>>> manipulation. >> >>>> >> >>>> Thanks, >> >>>> Micah >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones <will.jones...@gmail.com> >> >>>> wrote: >> >>>> >> >>>>> In being more engine and format agnostic, I agree the Arrow project >> >> might >> >>>>> be a good host for such a specification. It seems like we want to >> move >> >>>> away >> >>>>> from hosting in Spark to make it engine agnostic. But moving into >> >> Iceberg >> >>>>> might make it less format agnostic, as I understand multiple formats >> >>>> might >> >>>>> want to implement this. I'm not intimately familiar with the state >> of >> >>>> this, >> >>>>> but I believe Delta Lake would like to be aligned with the same >> format >> >> as >> >>>>> Iceberg. In addition, the Lance format (which I work on), will >> >> eventually >> >>>>> be interesting as well. It seems equally bad to me to attach this >> >>>>> specification to a particular table format as it does a particular >> >> query >> >>>>> engine. >> >>>>> >> >>>>> That being said, I think the most important consideration for now is >> >>>> where >> >>>>> are the current maintainers / contributors to the variant type. If >> most >> >>>> of >> >>>>> them are already PMC members / committers on a project, it becomes a >> >> bit >> >>>>> easier. Otherwise if there isn't much overlap with a project's >> existing >> >>>>> governance, I worry there could be a bit of friction. How many >> active >> >>>>> contributors are there from Iceberg? And how about from Arrow? >> >>>>> >> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension >> type >> >>>> for >> >>>>> the binary variant type. I've been experimenting with a DataFusion >> >>>>> extension that operates on this [1], and already have some ideas on >> how >> >>>>> such an extension type might be defined. I'm not yet caught up on >> the >> >>>>> shredded specification, but I think having just the binary format >> would >> >>>> be >> >>>>> beneficial for in-memory analytics, which are most relevant to >> Arrow. >> >>>> I'll >> >>>>> be creating a seperate thread on the Arrow ML about this soon. >> >>>>> >> >>>>> Best, >> >>>>> >> >>>>> Will Jones >> >>>>> >> >>>>> [1] >> >>>>> >> >>>> >> >> >> https://github.com/datafusion-contrib/datafusion-functions-variant/issues >> >>>>> >> >>>>> >> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: >> >>>>> >> >>>>>> + dev@arrow >> >>>>>> >> >>>>>> Thanks for all the valuable suggestions! I am inclined to Micah's >> idea >> >>>>> that >> >>>>>> Arrow might be a better host compared to Parquet. >> >>>>>> >> >>>>>> To give more context, I am taking the initiative to add the >> geometry >> >>>> type >> >>>>>> to both Parquet and ORC. I'd like to do the same thing for variant >> >> type >> >>>>> in >> >>>>>> that variant type is engine and file format agnostic. This does >> mean >> >>>> that >> >>>>>> Parquet might not be the neutral place to hold the variant spec. >> >>>>>> >> >>>>>> Best, >> >>>>>> Gang >> >>>>>> >> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li < >> jingsongl...@gmail.com> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Thanks all for your discussion. >> >>>>>>> >> >>>>>>> The Apache Paimon community is also considering support for this >> >>>>>>> Variant type, without a doubt, we hope to maintain consistency >> with >> >>>>>>> Iceberg. >> >>>>>>> >> >>>>>>> Not only the Paimon community, but also various computing engines >> >>>> need >> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also hope >> to >> >>>>>>> promote them to adapt to this type. >> >>>>>>> >> >>>>>>> It is worth noting that we also need to standardize many functions >> >>>>>>> related to it. >> >>>>>>> >> >>>>>>> A neutral place to maintain it is a great choice. >> >>>>>>> >> >>>>>>> - As Gang Wu said, a standalone project is good, just like >> >>>>> RoaringBitmap >> >>>>>>> [1]. >> >>>>>>> - As Ryan said, Parquet community is a neutral option too. >> >>>>>>> - As Micah said, Arrow is also an option too. >> >>>>>>> >> >>>>>>> [1] https://github.com/RoaringBitmap >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> Jingsong >> >>>>>>> >> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < >> >>>> emkornfi...@gmail.com >> >>>>>> >> >>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct >> and >> >>>>> off >> >>>>>>> the dev list. Would you like to make the request on the public >> Spark >> >>>>> Dev >> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >> email >> >>>> if >> >>>>>> you >> >>>>>>> don't have time. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> I think once we come to consensus, if you have bandwidth, I think >> >>>> the >> >>>>>>> message might be better coming from you, as you have more context >> on >> >>>>> some >> >>>>>>> of the non-public conversations, the requirements from an Iceberg >> >>>>>>> perspective on governance and the blockers that were >> encountered. If >> >>>>>>> details on the conversations can't be shared, (i.e. we are >> starting >> >>>>> from >> >>>>>>> scratch) it seems like suggesting a new project via SPIP might be >> the >> >>>>> way >> >>>>>>> forward. I'm happy to help with that if it is useful but I would >> >>>> guess >> >>>>>>> Aihua or Tyler might be in a better place to start as it seems >> they >> >>>>> have >> >>>>>>> done more serious thinking here. >> >>>>>>>> >> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm happy >> to >> >>>>>> help >> >>>>>>> support the effort in those communities. >> >>>>>>>> >> >>>>>>>> Thanks, >> >>>>>>>> Micah >> >>>>>>>> >> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < >> >>>>>>> russell.spit...@gmail.com> wrote: >> >>>>>>>>> >> >>>>>>>>> Thats fair @Micah, so far all the discussions have been direct >> and >> >>>>> off >> >>>>>>> the dev list. Would you like to make the request on the public >> Spark >> >>>>> Dev >> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >> email >> >>>> if >> >>>>>> you >> >>>>>>> don't have time. >> >>>>>>>>> >> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < >> >>>>>> emkornfi...@gmail.com> >> >>>>>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the >> >>>>> main >> >>>>>>> problem is political and not logistic. I've been asking for >> movement >> >>>>> from >> >>>>>>> other relative projects for a month and we simply haven't gotten >> >>>>>> anywhere. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> I just wanted to double check that these issues were brought >> >>>>> directly >> >>>>>>> to the spark community (i.e. a discussion thread on the Spark >> >>>> developer >> >>>>>>> mailing list) and not via backchannels. >> >>>>>>>>>> >> >>>>>>>>>> I'm not sure the outcome would be different and I don't think >> >>>> this >> >>>>>>> should block forking the spec, but we should make sure that the >> >>>>> decision >> >>>>>> is >> >>>>>>> publicly documented within both communities. >> >>>>>>>>>> >> >>>>>>>>>> Thanks, >> >>>>>>>>>> Micah >> >>>>>>>>>> >> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >> >>>>>>> russell.spit...@gmail.com> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>> @Gang Wu >> >>>>>>>>>>> >> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, the >> >>>>> main >> >>>>>>> problem is political and not logistic. I've been asking for >> movement >> >>>>> from >> >>>>>>> other relative projects for a month and we simply haven't gotten >> >>>>>> anywhere. >> >>>>>>> I don't think there is anything that would stop us from moving to >> a >> >>>>> joint >> >>>>>>> project in the future and if you know of some way of encouraging >> that >> >>>>>>> movement from other relevant parties I would be glad to >> collaborate >> >>>> in >> >>>>>>> doing that. One thing that I don't want to do is have the Iceberg >> >>>>> project >> >>>>>>> stay in a holding pattern without any clear roadmap as to how to >> >>>>> proceed. >> >>>>>>>>>>> >> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu < >> flyrain...@gmail.com >> >>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>> I’m on board with copying the spec into our repository. >> >>>> However, >> >>>>> as >> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there are >> >>>>>> already >> >>>>>>> some divergences. Some of them are under discussion. Iceberg is >> >>>>>> definitely >> >>>>>>> the best place for these specs. Engines like Trino and Flink can >> then >> >>>>>> rely >> >>>>>>> on the Iceberg specs as a solid foundation. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Yufei >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> >> >>>>> wrote: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Sorry for chiming in late. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> From the discussion in >> >>>>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, >> I >> >>>>>> don't >> >>>>>>> quite understand why it is logistically complicated to create a >> >>>>>> sub-project >> >>>>>>> to hold the variant spec and impl. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has >> >>>> some >> >>>>>>> deficiencies: >> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant >> >>>> type >> >>>>>>> spec change and will likely result in deviation if some changes do >> >>>> not >> >>>>>>> reach agreement from both parties. >> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs >> >>>>>>> (considering proprietary engines where both Iceberg and Delta are >> >>>>>>> supported). >> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg repo >> >>>> does >> >>>>>>> lose the opportunity for better native support from file formats >> like >> >>>>>>> Parquet and ORC. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate project >> >>>>> (e.g. >> >>>>>>> apache/variant-type) to make it a single point of truth. We can >> learn >> >>>>>> from >> >>>>>>> the experience of Apache Arrow. In this fashion, different >> engines, >> >>>>> table >> >>>>>>> formats and file formats can follow the same spec and are free to >> >>>>> depend >> >>>>>> on >> >>>>>>> the reference implementations from apache/variant-type or >> implement >> >>>>> their >> >>>>>>> own. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Best, >> >>>>>>>>>>>>> Gang >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye < >> yezhao...@gmail.com >> >>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we >> need >> >>>> to >> >>>>>>> own it fully as a part of the table spec, and we can build >> >>>>> compatibility >> >>>>>>> through tests. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -Jack >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >> >>>>>>> russell.spit...@gmail.com> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as that >> >>>> just >> >>>>>>> makes things more complicated and still is essentially forking >> just >> >>>>> with >> >>>>>>> more steps. If we just track our annotations / modifications to a >> >>>>> single >> >>>>>>> commit/version then we have the same issue again but now you have >> to >> >>>> go >> >>>>>> to >> >>>>>>> multiple sources to get the actual Spec. In addition, our very >> copy >> >>>> of >> >>>>>> the >> >>>>>>> Spec is going to require new types which don't exist in the Spark >> >>>> Spec >> >>>>>>> which necessarily means diverging. We will need to take up new >> >>>>> primitive >> >>>>>>> id's (as noted in my first email) >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec is >> >>>>> really >> >>>>>>> going through a thorough review process from all members of the >> Spark >> >>>>>>> community, I believe it probably should have gone through the SPIP >> >>>> but >> >>>>>>> instead seems to have been merged without broad community >> >>>> involvement. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a >> >>>> single >> >>>>>>> copy of the spec, in our previous discussions the vast majority of >> >>>>> Apache >> >>>>>>> Iceberg community want it to exist here. >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < >> >>>>> dwe...@apache.org >> >>>>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant type >> >>>> to >> >>>>>>> Iceberg, but I want to raise concerns about forking the spec. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the >> situation >> >>>>>>> where we end up diverging because there's little reason to work >> with >> >>>>> both >> >>>>>>> communities to evolve in a way that benefits everyone. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the >> spec >> >>>>> and >> >>>>>>> annotate any variance in Iceberg's handling. This would allow us >> to >> >>>>>>> continue without dividing the communities. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I >> >>>> would >> >>>>>>> support forking, but I don't feel like that should be the initial >> >>>> step. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> No one is excited about the possibility that the physical >> >>>>>>> representations end up diverging, but it feels like we're setting >> >>>>>> ourselves >> >>>>>>> up for that exact scenario. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> -Dan >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < >> >>>>>>> fo...@apache.org> wrote: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to copy >> >>>> the >> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but at >> >>>> the >> >>>>>> same >> >>>>>>> time, we should maintain compatibility. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Kind regards, >> >>>>>>>>>>>>>>>>> Fokko >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >> >>>>>>> owenzhang1...@gmail.com>: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the >> best >> >>>>> way >> >>>>>>> to keep compatibility is building integration tests. >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>>>>>> Manu >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >> >>>>>>> peter.vary.apa...@gmail.com> wrote: >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant support! >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types and >> >>>> the >> >>>>>>> lack of interest from the other project, I think it is reasonable >> to >> >>>>>>> duplicate the specification to our repository. >> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the >> >>>> Spark >> >>>>>>> spec as much as possible, to keep compatibility as much as >> possible. >> >>>>>> Maybe >> >>>>>>> even revert to a shared specification if the situation changes. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>>>>>>> Peter >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: 2024. >> >>>>> aug. >> >>>>>>> 13., K, 19:52): >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the >> >>>> Variant >> >>>>>>> support in Iceberg and hopefully we can have a consensus. To me, I >> >>>> also >> >>>>>>> feel it makes more sense to move the spec into Iceberg rather than >> >>>>> Spark >> >>>>>>> engine owns it and we try to keep it compatible with Spark spec. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> Thanks, >> >>>>>>>>>>>>>>>>>>>> Aihua >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >> >>>>>>> russell.spit...@gmail.com> wrote: >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Hi Y’all, >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant >> >>>>> Proposal, >> >>>>>>> while we were hoping to move the Variant and Shredding >> specifications >> >>>>>> from >> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in >> >>>> that. >> >>>>>>> Unfortunately, I think we have a number of issues with just >> linking >> >>>> to >> >>>>>> the >> >>>>>>> Spark project directly from within Iceberg and I believe we need >> to >> >>>>> copy >> >>>>>>> the specifications into our repository. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is >> necessary >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The >> >>>> Spark >> >>>>>>> Specification already includes types which Iceberg has no >> definition >> >>>>> for >> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which is >> not >> >>>>>>> included within the Spark Specification (Time) and will soon have >> >>>> more >> >>>>>> with >> >>>>>>> TimestampNS, and Geo. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is >> not a >> >>>>>> hard >> >>>>>>> dependency for other engines. We are working with several >> >>>> implementers >> >>>>> of >> >>>>>>> the Iceberg spec and it has previously been agreed that it would >> be >> >>>>> best >> >>>>>> if >> >>>>>>> the source of truth for Variant existed in an engine and file >> format >> >>>>>>> neutral location. The Iceberg project has a good open model of >> >>>>> governance >> >>>>>>> and, as we have seen so far discussing Variant, open and active >> >>>>>>> collaboration. This would also help as we can strictly version our >> >>>>>> changes >> >>>>>>> in-line with the rest of the Iceberg spec. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished and >> >>>>>>> requires some group analysis and discussion before we commit it. I >> >>>>> think >> >>>>>>> again the Iceberg community is probably the right place for this >> to >> >>>>>> happen >> >>>>>>> as we have already started discussions here on these topics. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a direct >> >>>>> copy >> >>>>>>> of the existing specification from the Spark Project and move >> ahead >> >>>>> with >> >>>>>>> our discussions and modifications within Iceberg. That said, I do >> not >> >>>>>> want >> >>>>>>> to diverge if possible from the Spark proposal. For example, >> although >> >>>>> we >> >>>>>> do >> >>>>>>> not use the Interval types above, I think we should not reuse >> those >> >>>>> type >> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 would >> >>>>> remain >> >>>>>>> unused along with any other types we think are not applicable. We >> >>>>> should >> >>>>>>> strive whenever possible to allow for compatibility. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this >> proposal I >> >>>>> am >> >>>>>>> hoping to see if anyone in the community objects to this plan >> going >> >>>>>> forward >> >>>>>>> or has a better alternative. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am eager >> to >> >>>>> hear >> >>>>>>> back from everyone, >> >>>>>>>>>>>>>>>>>>>>> Russ >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> > >> >