Thanks Gang for initiating the discussion. On Fri, Aug 23, 2024 at 2:22 AM Gang Wu <ust...@gmail.com> wrote:
> Thanks Aihua! > > I've started the discussion in dev@parquet: > https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z > > Best, > Gang > > On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com> wrote: > >> From this thread >> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, seems >> Spark community is leaning toward moving to Parquet. >> >> Gang, can you help start a discussion in the parquet community on >> adopting and maintaining such Variant spec? >> >> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org> >> wrote: >> >>> This seems to straddle that line, in that you can also view this as a >>> way to represent semi-structured data in a manner that allows for more >>> efficient querying and computation by breaking out some of its components >>> into a more structured form. >>> >>> (I also happen to want a canonical Arrow representation for variant >>> data, as this type occurs in many databases but doesn't have a great >>> representation today in ADBC results. That's why I filed [Format] >>> Consider adding an official variant type to Arrow · Issue #42069 · >>> apache/arrow (github.com) <https://github.com/apache/arrow/issues/42069>. >>> Of course, there's no specific reason why a canonical Arrow >>> representation for variants must align with Spark and/or Iceberg.) >>> >>> -Curt >>> >>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org> >>> wrote: >>> >>>> >>>> Ah, thanks. I've tried to find a rationale and ended up on >>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is >>>> it >>>> a good description of what you're after? >>>> >>>> If so, then I don't think Arrow is a good match. This seems mostly to >>>> be >>>> a marshalling format for semi-structured data (like Avro?). Arrow data >>>> types are meant to be in a representation ideal for querying and >>>> computation, rather than transport and storage. >>>> >>>> This could be developed separately and then be represented in Arrow >>>> using an extension type (perhaps a canonical one as in >>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html). >>>> >>>> What do other Arrow developers think? >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> Le 22/08/2024 à 10:45, Gang Wu a écrit : >>>> > Sorry for the inconvenience. >>>> > >>>> > This is the permalink for the discussion: >>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw >>>> > >>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> >>>> wrote: >>>> > >>>> >> >>>> >> Hi Gang, >>>> >> >>>> >> Sorry, but can you give a pointer to the start of this discussion >>>> thread >>>> >> in a readable format (for example a mailing-list archive)? It appears >>>> >> that dev@arrow wasn't cc'ed from the start and that can make it >>>> >> difficult to understand what this is about. >>>> >> >>>> >> Regards >>>> >> >>>> >> Antoine. >>>> >> >>>> >> >>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit : >>>> >>> It seems that we have reached a consensus to some extent that there >>>> >>> should be a new home for the variant spec. The pending question >>>> >>> is whether Parquet or Arrow is a better choice. As a committer from >>>> >> Arrow, >>>> >>> Parquet and ORC communities, I am neutral to choose any and happy to >>>> >>> help with the movement once a decision has been made. >>>> >>> >>>> >>> Should we start a vote to move forward? >>>> >>> >>>> >>> Best, >>>> >>> Gang >>>> >>> >>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield < >>>> emkornfi...@gmail.com> >>>> >>> wrote: >>>> >>> >>>> >>>>> >>>> >>>>> That being said, I think the most important consideration for now >>>> is >>>> >>>> where >>>> >>>>> are the current maintainers / contributors to the variant type. >>>> If most >>>> >>>> of >>>> >>>>> them are already PMC members / committers on a project, it >>>> becomes a >>>> >> bit >>>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>>> existing >>>> >>>>> governance, I worry there could be a bit of friction. How many >>>> active >>>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>>> >>>> >>>> >>>> >>>> >>>> I think this is the key question. What are the requirements around >>>> >>>> governance? I've seen some tangential messaging here but I'm not >>>> clear >>>> >> on >>>> >>>> what everyone expects. >>>> >>>> >>>> >>>> I think for a lot of the other concerns my view is that the exact >>>> >> project >>>> >>>> does not really matter (and choosing a project with mature cross >>>> >> language >>>> >>>> testing infrastructure or committing to building it is critical). >>>> IIUC >>>> >> we >>>> >>>> are talking about following artifacts: >>>> >>>> >>>> >>>> 1. A stand alone specification document (this can be hosted >>>> anyplace) >>>> >>>> 2. A set of language bindings with minimal dependencies can be >>>> consumed >>>> >>>> downstream (again, as long as dependencies are managed carefully >>>> any >>>> >>>> project can host these) >>>> >>>> 3. Potential integration where appropriate into file format >>>> libraries >>>> >> to >>>> >>>> support shredding (but as of now this is being bypassed by using >>>> >>>> conventions anyways). My impression is that at least for Parquet >>>> there >>>> >> has >>>> >>>> been a proliferation of vectorized readers across different >>>> projects, so >>>> >>>> I'm not clear how much standardization in parquet-java could help >>>> here. >>>> >>>> >>>> >>>> To respond to some other questions: >>>> >>>> >>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and others >>>> so >>>> >> those >>>> >>>>> existing relationships aren't there. I also worry that >>>> differences in >>>> >>>>> approaches would make it difficult later on. >>>> >>>> >>>> >>>> >>>> >>>> While Arrow is not in the core memory model, for Spark I believe >>>> it is >>>> >>>> still used for IPC for things like Java<->Python. Trino also >>>> consumes >>>> >> Arrow >>>> >>>> libraries today to support things like Snowflake/Bigquery >>>> federation. >>>> >> But I >>>> >>>> think this is minor because as mentioned above I think the >>>> functional >>>> >>>> libraries would be relatively stand-alone. >>>> >>>> >>>> >>>> Do we think it could be introduced as a canonical extension arrow >>>> type? >>>> >>>> >>>> >>>> >>>> >>>> I believe it can be, I think there are probably different >>>> layouts >>>> >> that can >>>> >>>> be supported: >>>> >>>> >>>> >>>> 1. A struct with two variable width bytes columns (metadata and >>>> value >>>> >> data >>>> >>>> are stored separately and each entry has a 1:1 relationship). >>>> >>>> 2. Shredded (shredded according to the same convention as >>>> parquet), I >>>> >>>> would need to double check but I don't think Arrow would have >>>> problems >>>> >> here >>>> >>>> but REE would likely be required to make this efficient (i.e. >>>> sparse >>>> >> value >>>> >>>> support is important). >>>> >>>> >>>> >>>> In both cases the main complexity is providing the necessary >>>> functions >>>> >> for >>>> >>>> manipulation. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Micah >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones < >>>> will.jones...@gmail.com> >>>> >>>> wrote: >>>> >>>> >>>> >>>>> In being more engine and format agnostic, I agree the Arrow >>>> project >>>> >> might >>>> >>>>> be a good host for such a specification. It seems like we want to >>>> move >>>> >>>> away >>>> >>>>> from hosting in Spark to make it engine agnostic. But moving into >>>> >> Iceberg >>>> >>>>> might make it less format agnostic, as I understand multiple >>>> formats >>>> >>>> might >>>> >>>>> want to implement this. I'm not intimately familiar with the >>>> state of >>>> >>>> this, >>>> >>>>> but I believe Delta Lake would like to be aligned with the same >>>> format >>>> >> as >>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will >>>> >> eventually >>>> >>>>> be interesting as well. It seems equally bad to me to attach this >>>> >>>>> specification to a particular table format as it does a particular >>>> >> query >>>> >>>>> engine. >>>> >>>>> >>>> >>>>> That being said, I think the most important consideration for now >>>> is >>>> >>>> where >>>> >>>>> are the current maintainers / contributors to the variant type. >>>> If most >>>> >>>> of >>>> >>>>> them are already PMC members / committers on a project, it >>>> becomes a >>>> >> bit >>>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>>> existing >>>> >>>>> governance, I worry there could be a bit of friction. How many >>>> active >>>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>>> >>>>> >>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow extension >>>> type >>>> >>>> for >>>> >>>>> the binary variant type. I've been experimenting with a DataFusion >>>> >>>>> extension that operates on this [1], and already have some ideas >>>> on how >>>> >>>>> such an extension type might be defined. I'm not yet caught up on >>>> the >>>> >>>>> shredded specification, but I think having just the binary format >>>> would >>>> >>>> be >>>> >>>>> beneficial for in-memory analytics, which are most relevant to >>>> Arrow. >>>> >>>> I'll >>>> >>>>> be creating a seperate thread on the Arrow ML about this soon. >>>> >>>>> >>>> >>>>> Best, >>>> >>>>> >>>> >>>>> Will Jones >>>> >>>>> >>>> >>>>> [1] >>>> >>>>> >>>> >>>> >>>> >> >>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues >>>> >>>>> >>>> >>>>> >>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> wrote: >>>> >>>>> >>>> >>>>>> + dev@arrow >>>> >>>>>> >>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to >>>> Micah's idea >>>> >>>>> that >>>> >>>>>> Arrow might be a better host compared to Parquet. >>>> >>>>>> >>>> >>>>>> To give more context, I am taking the initiative to add the >>>> geometry >>>> >>>> type >>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for >>>> variant >>>> >> type >>>> >>>>> in >>>> >>>>>> that variant type is engine and file format agnostic. This does >>>> mean >>>> >>>> that >>>> >>>>>> Parquet might not be the neutral place to hold the variant spec. >>>> >>>>>> >>>> >>>>>> Best, >>>> >>>>>> Gang >>>> >>>>>> >>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li < >>>> jingsongl...@gmail.com> >>>> >>>>>> wrote: >>>> >>>>>> >>>> >>>>>>> Thanks all for your discussion. >>>> >>>>>>> >>>> >>>>>>> The Apache Paimon community is also considering support for this >>>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency >>>> with >>>> >>>>>>> Iceberg. >>>> >>>>>>> >>>> >>>>>>> Not only the Paimon community, but also various computing >>>> engines >>>> >>>> need >>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also >>>> hope to >>>> >>>>>>> promote them to adapt to this type. >>>> >>>>>>> >>>> >>>>>>> It is worth noting that we also need to standardize many >>>> functions >>>> >>>>>>> related to it. >>>> >>>>>>> >>>> >>>>>>> A neutral place to maintain it is a great choice. >>>> >>>>>>> >>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like >>>> >>>>> RoaringBitmap >>>> >>>>>>> [1]. >>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too. >>>> >>>>>>> - As Micah said, Arrow is also an option too. >>>> >>>>>>> >>>> >>>>>>> [1] https://github.com/RoaringBitmap >>>> >>>>>>> >>>> >>>>>>> Best, >>>> >>>>>>> Jingsong >>>> >>>>>>> >>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < >>>> >>>> emkornfi...@gmail.com >>>> >>>>>> >>>> >>>>>>> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been >>>> direct and >>>> >>>>> off >>>> >>>>>>> the dev list. Would you like to make the request on the public >>>> Spark >>>> >>>>> Dev >>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>>> email >>>> >>>> if >>>> >>>>>> you >>>> >>>>>>> don't have time. >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I >>>> think >>>> >>>> the >>>> >>>>>>> message might be better coming from you, as you have more >>>> context on >>>> >>>>> some >>>> >>>>>>> of the non-public conversations, the requirements from an >>>> Iceberg >>>> >>>>>>> perspective on governance and the blockers that were >>>> encountered. If >>>> >>>>>>> details on the conversations can't be shared, (i.e. we are >>>> starting >>>> >>>>> from >>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might >>>> be the >>>> >>>>> way >>>> >>>>>>> forward. I'm happy to help with that if it is useful but I >>>> would >>>> >>>> guess >>>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems >>>> they >>>> >>>>> have >>>> >>>>>>> done more serious thinking here. >>>> >>>>>>>> >>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm >>>> happy to >>>> >>>>>> help >>>> >>>>>>> support the effort in those communities. >>>> >>>>>>>> >>>> >>>>>>>> Thanks, >>>> >>>>>>>> Micah >>>> >>>>>>>> >>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < >>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been >>>> direct and >>>> >>>>> off >>>> >>>>>>> the dev list. Would you like to make the request on the public >>>> Spark >>>> >>>>> Dev >>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>>> email >>>> >>>> if >>>> >>>>>> you >>>> >>>>>>> don't have time. >>>> >>>>>>>>> >>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < >>>> >>>>>> emkornfi...@gmail.com> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>>> the >>>> >>>>> main >>>> >>>>>>> problem is political and not logistic. I've been asking for >>>> movement >>>> >>>>> from >>>> >>>>>>> other relative projects for a month and we simply haven't gotten >>>> >>>>>> anywhere. >>>> >>>>>>>>>> >>>> >>>>>>>>>> >>>> >>>>>>>>>> I just wanted to double check that these issues were brought >>>> >>>>> directly >>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark >>>> >>>> developer >>>> >>>>>>> mailing list) and not via backchannels. >>>> >>>>>>>>>> >>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't think >>>> >>>> this >>>> >>>>>>> should block forking the spec, but we should make sure that the >>>> >>>>> decision >>>> >>>>>> is >>>> >>>>>>> publicly documented within both communities. >>>> >>>>>>>>>> >>>> >>>>>>>>>> Thanks, >>>> >>>>>>>>>> Micah >>>> >>>>>>>>>> >>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> @Gang Wu >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>>> the >>>> >>>>> main >>>> >>>>>>> problem is political and not logistic. I've been asking for >>>> movement >>>> >>>>> from >>>> >>>>>>> other relative projects for a month and we simply haven't gotten >>>> >>>>>> anywhere. >>>> >>>>>>> I don't think there is anything that would stop us from moving >>>> to a >>>> >>>>> joint >>>> >>>>>>> project in the future and if you know of some way of >>>> encouraging that >>>> >>>>>>> movement from other relevant parties I would be glad to >>>> collaborate >>>> >>>> in >>>> >>>>>>> doing that. One thing that I don't want to do is have the >>>> Iceberg >>>> >>>>> project >>>> >>>>>>> stay in a holding pattern without any clear roadmap as to how to >>>> >>>>> proceed. >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu < >>>> flyrain...@gmail.com >>>> >>>>> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository. >>>> >>>> However, >>>> >>>>> as >>>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there >>>> are >>>> >>>>>> already >>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is >>>> >>>>>> definitely >>>> >>>>>>> the best place for these specs. Engines like Trino and Flink >>>> can then >>>> >>>>>> rely >>>> >>>>>>> on the Iceberg specs as a solid foundation. >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> Yufei >>>> >>>>>>>>>>>> >>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com> >>>> >>>>> wrote: >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> Sorry for chiming in late. >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> From the discussion in >>>> >>>>>>> >>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>>> >>>>>> don't >>>> >>>>>>> quite understand why it is logistically complicated to create a >>>> >>>>>> sub-project >>>> >>>>>>> to hold the variant spec and impl. >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg has >>>> >>>> some >>>> >>>>>>> deficiencies: >>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a variant >>>> >>>> type >>>> >>>>>>> spec change and will likely result in deviation if some changes >>>> do >>>> >>>> not >>>> >>>>>>> reach agreement from both parties. >>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs >>>> >>>>>>> (considering proprietary engines where both Iceberg and Delta >>>> are >>>> >>>>>>> supported). >>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg >>>> repo >>>> >>>> does >>>> >>>>>>> lose the opportunity for better native support from file >>>> formats like >>>> >>>>>>> Parquet and ORC. >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate >>>> project >>>> >>>>> (e.g. >>>> >>>>>>> apache/variant-type) to make it a single point of truth. We can >>>> learn >>>> >>>>>> from >>>> >>>>>>> the experience of Apache Arrow. In this fashion, different >>>> engines, >>>> >>>>> table >>>> >>>>>>> formats and file formats can follow the same spec and are free >>>> to >>>> >>>>> depend >>>> >>>>>> on >>>> >>>>>>> the reference implementations from apache/variant-type or >>>> implement >>>> >>>>> their >>>> >>>>>>> own. >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> Best, >>>> >>>>>>>>>>>>> Gang >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> >>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye < >>>> yezhao...@gmail.com >>>> >>>>> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we >>>> need >>>> >>>> to >>>> >>>>>>> own it fully as a part of the table spec, and we can build >>>> >>>>> compatibility >>>> >>>>>>> through tests. >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> -Jack >>>> >>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as >>>> that >>>> >>>> just >>>> >>>>>>> makes things more complicated and still is essentially forking >>>> just >>>> >>>>> with >>>> >>>>>>> more steps. If we just track our annotations / modifications >>>> to a >>>> >>>>> single >>>> >>>>>>> commit/version then we have the same issue again but now you >>>> have to >>>> >>>> go >>>> >>>>>> to >>>> >>>>>>> multiple sources to get the actual Spec. In addition, our very >>>> copy >>>> >>>> of >>>> >>>>>> the >>>> >>>>>>> Spec is going to require new types which don't exist in the >>>> Spark >>>> >>>> Spec >>>> >>>>>>> which necessarily means diverging. We will need to take up new >>>> >>>>> primitive >>>> >>>>>>> id's (as noted in my first email) >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec >>>> is >>>> >>>>> really >>>> >>>>>>> going through a thorough review process from all members of the >>>> Spark >>>> >>>>>>> community, I believe it probably should have gone through the >>>> SPIP >>>> >>>> but >>>> >>>>>>> instead seems to have been merged without broad community >>>> >>>> involvement. >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a >>>> >>>> single >>>> >>>>>>> copy of the spec, in our previous discussions the vast majority >>>> of >>>> >>>>> Apache >>>> >>>>>>> Iceberg community want it to exist here. >>>> >>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < >>>> >>>>> dwe...@apache.org >>>> >>>>>>> >>>> >>>>>>> wrote: >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant >>>> type >>>> >>>> to >>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the >>>> situation >>>> >>>>>>> where we end up diverging because there's little reason to work >>>> with >>>> >>>>> both >>>> >>>>>>> communities to evolve in a way that benefits everyone. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of the >>>> spec >>>> >>>>> and >>>> >>>>>>> annotate any variance in Iceberg's handling. This would allow >>>> us to >>>> >>>>>>> continue without dividing the communities. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, I >>>> >>>> would >>>> >>>>>>> support forking, but I don't feel like that should be the >>>> initial >>>> >>>> step. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the >>>> physical >>>> >>>>>>> representations end up diverging, but it feels like we're >>>> setting >>>> >>>>>> ourselves >>>> >>>>>>> up for that exact scenario. >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> -Dan >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < >>>> >>>>>>> fo...@apache.org> wrote: >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to >>>> copy >>>> >>>> the >>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, but >>>> at >>>> >>>> the >>>> >>>>>> same >>>> >>>>>>> time, we should maintain compatibility. >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> Kind regards, >>>> >>>>>>>>>>>>>>>>> Fokko >>>> >>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>>> >>>>>>> owenzhang1...@gmail.com>: >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think the >>>> best >>>> >>>>> way >>>> >>>>>>> to keep compatibility is building integration tests. >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> Thanks, >>>> >>>>>>>>>>>>>>>>>> Manu >>>> >>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>> >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant >>>> support! >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types >>>> and >>>> >>>> the >>>> >>>>>>> lack of interest from the other project, I think it is >>>> reasonable to >>>> >>>>>>> duplicate the specification to our repository. >>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to the >>>> >>>> Spark >>>> >>>>>>> spec as much as possible, to keep compatibility as much as >>>> possible. >>>> >>>>>> Maybe >>>> >>>>>>> even revert to a shared specification if the situation changes. >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>> >>>>>>>>>>>>>>>>>>> Peter >>>> >>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: >>>> 2024. >>>> >>>>> aug. >>>> >>>>>>> 13., K, 19:52): >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up. >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the >>>> >>>> Variant >>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To >>>> me, I >>>> >>>> also >>>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather >>>> than >>>> >>>>> Spark >>>> >>>>>>> engine owns it and we try to keep it compatible with Spark spec. >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>> >>>>>>>>>>>>>>>>>>>> Aihua >>>> >>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all, >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant >>>> >>>>> Proposal, >>>> >>>>>>> while we were hoping to move the Variant and Shredding >>>> specifications >>>> >>>>>> from >>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest in >>>> >>>> that. >>>> >>>>>>> Unfortunately, I think we have a number of issues with just >>>> linking >>>> >>>> to >>>> >>>>>> the >>>> >>>>>>> Spark project directly from within Iceberg and I believe we >>>> need to >>>> >>>>> copy >>>> >>>>>>> the specifications into our repository. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is >>>> necessary >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The >>>> >>>> Spark >>>> >>>>>>> Specification already includes types which Iceberg has no >>>> definition >>>> >>>>> for >>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which >>>> is not >>>> >>>>>>> included within the Spark Specification (Time) and will soon >>>> have >>>> >>>> more >>>> >>>>>> with >>>> >>>>>>> TimestampNS, and Geo. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is >>>> not a >>>> >>>>>> hard >>>> >>>>>>> dependency for other engines. We are working with several >>>> >>>> implementers >>>> >>>>> of >>>> >>>>>>> the Iceberg spec and it has previously been agreed that it >>>> would be >>>> >>>>> best >>>> >>>>>> if >>>> >>>>>>> the source of truth for Variant existed in an engine and file >>>> format >>>> >>>>>>> neutral location. The Iceberg project has a good open model of >>>> >>>>> governance >>>> >>>>>>> and, as we have seen so far discussing Variant, open and active >>>> >>>>>>> collaboration. This would also help as we can strictly version >>>> our >>>> >>>>>> changes >>>> >>>>>>> in-line with the rest of the Iceberg spec. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished >>>> and >>>> >>>>>>> requires some group analysis and discussion before we commit >>>> it. I >>>> >>>>> think >>>> >>>>>>> again the Iceberg community is probably the right place for >>>> this to >>>> >>>>>> happen >>>> >>>>>>> as we have already started discussions here on these topics. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a >>>> direct >>>> >>>>> copy >>>> >>>>>>> of the existing specification from the Spark Project and move >>>> ahead >>>> >>>>> with >>>> >>>>>>> our discussions and modifications within Iceberg. That said, I >>>> do not >>>> >>>>>> want >>>> >>>>>>> to diverge if possible from the Spark proposal. For example, >>>> although >>>> >>>>> we >>>> >>>>>> do >>>> >>>>>>> not use the Interval types above, I think we should not reuse >>>> those >>>> >>>>> type >>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 >>>> would >>>> >>>>> remain >>>> >>>>>>> unused along with any other types we think are not applicable. >>>> We >>>> >>>>> should >>>> >>>>>>> strive whenever possible to allow for compatibility. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this >>>> proposal I >>>> >>>>> am >>>> >>>>>>> hoping to see if anyone in the community objects to this plan >>>> going >>>> >>>>>> forward >>>> >>>>>>> or has a better alternative. >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am >>>> eager to >>>> >>>>> hear >>>> >>>>>>> back from everyone, >>>> >>>>>>>>>>>>>>>>>>>>> Russ >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>>>>>>> >>>> >>>>>>> >>>> >>>>>> >>>> >>>>> >>>> >>>> >>>> >>> >>>> >> >>>> > >>>> >>>