Thank you Gang, that's sounds like a good idea to me as well On Fri, Aug 23, 2024 at 8:47 AM Aihua Xu <aihua...@snowflake.com.invalid> wrote:
> Thanks Gang for initiating the discussion. > > On Fri, Aug 23, 2024 at 2:22 AM Gang Wu <ust...@gmail.com> wrote: > >> Thanks Aihua! >> >> I've started the discussion in dev@parquet: >> https://lists.apache.org/thread/6h58hj39lhqtcyd2hlsyvqm4lzdh4b9z >> >> Best, >> Gang >> >> On Fri, Aug 23, 2024 at 12:53 PM Aihua Xu <aihua...@snowflake.com> wrote: >> >>> From this thread >>> https://lists.apache.org/thread/0k5oj3mn0049fcxoxm3gx3d7r28gw4rj, >>> seems Spark community is leaning toward moving to Parquet. >>> >>> Gang, can you help start a discussion in the parquet community on >>> adopting and maintaining such Variant spec? >>> >>> On Thu, Aug 22, 2024 at 8:08 AM Curt Hagenlocher <c...@hagenlocher.org> >>> wrote: >>> >>>> This seems to straddle that line, in that you can also view this as a >>>> way to represent semi-structured data in a manner that allows for more >>>> efficient querying and computation by breaking out some of its components >>>> into a more structured form. >>>> >>>> (I also happen to want a canonical Arrow representation for variant >>>> data, as this type occurs in many databases but doesn't have a great >>>> representation today in ADBC results. That's why I filed [Format] >>>> Consider adding an official variant type to Arrow · Issue #42069 · >>>> apache/arrow (github.com) >>>> <https://github.com/apache/arrow/issues/42069>. Of course, there's no >>>> specific reason why a canonical Arrow representation for variants must >>>> align with Spark and/or Iceberg.) >>>> >>>> -Curt >>>> >>>> On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou <anto...@python.org> >>>> wrote: >>>> >>>>> >>>>> Ah, thanks. I've tried to find a rationale and ended up on >>>>> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is >>>>> it >>>>> a good description of what you're after? >>>>> >>>>> If so, then I don't think Arrow is a good match. This seems mostly to >>>>> be >>>>> a marshalling format for semi-structured data (like Avro?). Arrow data >>>>> types are meant to be in a representation ideal for querying and >>>>> computation, rather than transport and storage. >>>>> >>>>> This could be developed separately and then be represented in Arrow >>>>> using an extension type (perhaps a canonical one as in >>>>> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html). >>>>> >>>>> What do other Arrow developers think? >>>>> >>>>> Regards >>>>> >>>>> Antoine. >>>>> >>>>> >>>>> Le 22/08/2024 à 10:45, Gang Wu a écrit : >>>>> > Sorry for the inconvenience. >>>>> > >>>>> > This is the permalink for the discussion: >>>>> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw >>>>> > >>>>> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou <anto...@python.org> >>>>> wrote: >>>>> > >>>>> >> >>>>> >> Hi Gang, >>>>> >> >>>>> >> Sorry, but can you give a pointer to the start of this discussion >>>>> thread >>>>> >> in a readable format (for example a mailing-list archive)? It >>>>> appears >>>>> >> that dev@arrow wasn't cc'ed from the start and that can make it >>>>> >> difficult to understand what this is about. >>>>> >> >>>>> >> Regards >>>>> >> >>>>> >> Antoine. >>>>> >> >>>>> >> >>>>> >> Le 22/08/2024 à 08:32, Gang Wu a écrit : >>>>> >>> It seems that we have reached a consensus to some extent that there >>>>> >>> should be a new home for the variant spec. The pending question >>>>> >>> is whether Parquet or Arrow is a better choice. As a committer from >>>>> >> Arrow, >>>>> >>> Parquet and ORC communities, I am neutral to choose any and happy >>>>> to >>>>> >>> help with the movement once a decision has been made. >>>>> >>> >>>>> >>> Should we start a vote to move forward? >>>>> >>> >>>>> >>> Best, >>>>> >>> Gang >>>>> >>> >>>>> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield < >>>>> emkornfi...@gmail.com> >>>>> >>> wrote: >>>>> >>> >>>>> >>>>> >>>>> >>>>> That being said, I think the most important consideration for >>>>> now is >>>>> >>>> where >>>>> >>>>> are the current maintainers / contributors to the variant type. >>>>> If most >>>>> >>>> of >>>>> >>>>> them are already PMC members / committers on a project, it >>>>> becomes a >>>>> >> bit >>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>>>> existing >>>>> >>>>> governance, I worry there could be a bit of friction. How many >>>>> active >>>>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>>>> >>>> >>>>> >>>> >>>>> >>>> I think this is the key question. What are the requirements around >>>>> >>>> governance? I've seen some tangential messaging here but I'm not >>>>> clear >>>>> >> on >>>>> >>>> what everyone expects. >>>>> >>>> >>>>> >>>> I think for a lot of the other concerns my view is that the exact >>>>> >> project >>>>> >>>> does not really matter (and choosing a project with mature cross >>>>> >> language >>>>> >>>> testing infrastructure or committing to building it is critical). >>>>> IIUC >>>>> >> we >>>>> >>>> are talking about following artifacts: >>>>> >>>> >>>>> >>>> 1. A stand alone specification document (this can be hosted >>>>> anyplace) >>>>> >>>> 2. A set of language bindings with minimal dependencies can be >>>>> consumed >>>>> >>>> downstream (again, as long as dependencies are managed carefully >>>>> any >>>>> >>>> project can host these) >>>>> >>>> 3. Potential integration where appropriate into file format >>>>> libraries >>>>> >> to >>>>> >>>> support shredding (but as of now this is being bypassed by using >>>>> >>>> conventions anyways). My impression is that at least for Parquet >>>>> there >>>>> >> has >>>>> >>>> been a proliferation of vectorized readers across different >>>>> projects, so >>>>> >>>> I'm not clear how much standardization in parquet-java could help >>>>> here. >>>>> >>>> >>>>> >>>> To respond to some other questions: >>>>> >>>> >>>>> >>>> Arrow is not used as Spark's in-memory model, nor Trino and >>>>> others so >>>>> >> those >>>>> >>>>> existing relationships aren't there. I also worry that >>>>> differences in >>>>> >>>>> approaches would make it difficult later on. >>>>> >>>> >>>>> >>>> >>>>> >>>> While Arrow is not in the core memory model, for Spark I believe >>>>> it is >>>>> >>>> still used for IPC for things like Java<->Python. Trino also >>>>> consumes >>>>> >> Arrow >>>>> >>>> libraries today to support things like Snowflake/Bigquery >>>>> federation. >>>>> >> But I >>>>> >>>> think this is minor because as mentioned above I think the >>>>> functional >>>>> >>>> libraries would be relatively stand-alone. >>>>> >>>> >>>>> >>>> Do we think it could be introduced as a canonical extension arrow >>>>> type? >>>>> >>>> >>>>> >>>> >>>>> >>>> I believe it can be, I think there are probably different >>>>> layouts >>>>> >> that can >>>>> >>>> be supported: >>>>> >>>> >>>>> >>>> 1. A struct with two variable width bytes columns (metadata and >>>>> value >>>>> >> data >>>>> >>>> are stored separately and each entry has a 1:1 relationship). >>>>> >>>> 2. Shredded (shredded according to the same convention as >>>>> parquet), I >>>>> >>>> would need to double check but I don't think Arrow would have >>>>> problems >>>>> >> here >>>>> >>>> but REE would likely be required to make this efficient (i.e. >>>>> sparse >>>>> >> value >>>>> >>>> support is important). >>>>> >>>> >>>>> >>>> In both cases the main complexity is providing the necessary >>>>> functions >>>>> >> for >>>>> >>>> manipulation. >>>>> >>>> >>>>> >>>> Thanks, >>>>> >>>> Micah >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> On Fri, Aug 16, 2024 at 3:58 PM Will Jones < >>>>> will.jones...@gmail.com> >>>>> >>>> wrote: >>>>> >>>> >>>>> >>>>> In being more engine and format agnostic, I agree the Arrow >>>>> project >>>>> >> might >>>>> >>>>> be a good host for such a specification. It seems like we want >>>>> to move >>>>> >>>> away >>>>> >>>>> from hosting in Spark to make it engine agnostic. But moving into >>>>> >> Iceberg >>>>> >>>>> might make it less format agnostic, as I understand multiple >>>>> formats >>>>> >>>> might >>>>> >>>>> want to implement this. I'm not intimately familiar with the >>>>> state of >>>>> >>>> this, >>>>> >>>>> but I believe Delta Lake would like to be aligned with the same >>>>> format >>>>> >> as >>>>> >>>>> Iceberg. In addition, the Lance format (which I work on), will >>>>> >> eventually >>>>> >>>>> be interesting as well. It seems equally bad to me to attach this >>>>> >>>>> specification to a particular table format as it does a >>>>> particular >>>>> >> query >>>>> >>>>> engine. >>>>> >>>>> >>>>> >>>>> That being said, I think the most important consideration for >>>>> now is >>>>> >>>> where >>>>> >>>>> are the current maintainers / contributors to the variant type. >>>>> If most >>>>> >>>> of >>>>> >>>>> them are already PMC members / committers on a project, it >>>>> becomes a >>>>> >> bit >>>>> >>>>> easier. Otherwise if there isn't much overlap with a project's >>>>> existing >>>>> >>>>> governance, I worry there could be a bit of friction. How many >>>>> active >>>>> >>>>> contributors are there from Iceberg? And how about from Arrow? >>>>> >>>>> >>>>> >>>>> BTW, I'd add I'm interested in helping develop an Arrow >>>>> extension type >>>>> >>>> for >>>>> >>>>> the binary variant type. I've been experimenting with a >>>>> DataFusion >>>>> >>>>> extension that operates on this [1], and already have some ideas >>>>> on how >>>>> >>>>> such an extension type might be defined. I'm not yet caught up >>>>> on the >>>>> >>>>> shredded specification, but I think having just the binary >>>>> format would >>>>> >>>> be >>>>> >>>>> beneficial for in-memory analytics, which are most relevant to >>>>> Arrow. >>>>> >>>> I'll >>>>> >>>>> be creating a seperate thread on the Arrow ML about this soon. >>>>> >>>>> >>>>> >>>>> Best, >>>>> >>>>> >>>>> >>>>> Will Jones >>>>> >>>>> >>>>> >>>>> [1] >>>>> >>>>> >>>>> >>>> >>>>> >> >>>>> https://github.com/datafusion-contrib/datafusion-functions-variant/issues >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Aug 15, 2024 at 7:39 PM Gang Wu <ust...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> + dev@arrow >>>>> >>>>>> >>>>> >>>>>> Thanks for all the valuable suggestions! I am inclined to >>>>> Micah's idea >>>>> >>>>> that >>>>> >>>>>> Arrow might be a better host compared to Parquet. >>>>> >>>>>> >>>>> >>>>>> To give more context, I am taking the initiative to add the >>>>> geometry >>>>> >>>> type >>>>> >>>>>> to both Parquet and ORC. I'd like to do the same thing for >>>>> variant >>>>> >> type >>>>> >>>>> in >>>>> >>>>>> that variant type is engine and file format agnostic. This does >>>>> mean >>>>> >>>> that >>>>> >>>>>> Parquet might not be the neutral place to hold the variant spec. >>>>> >>>>>> >>>>> >>>>>> Best, >>>>> >>>>>> Gang >>>>> >>>>>> >>>>> >>>>>> On Fri, Aug 16, 2024 at 10:00 AM Jingsong Li < >>>>> jingsongl...@gmail.com> >>>>> >>>>>> wrote: >>>>> >>>>>> >>>>> >>>>>>> Thanks all for your discussion. >>>>> >>>>>>> >>>>> >>>>>>> The Apache Paimon community is also considering support for >>>>> this >>>>> >>>>>>> Variant type, without a doubt, we hope to maintain consistency >>>>> with >>>>> >>>>>>> Iceberg. >>>>> >>>>>>> >>>>> >>>>>>> Not only the Paimon community, but also various computing >>>>> engines >>>>> >>>> need >>>>> >>>>>>> to adapt to this type, such as Flink and StarRocks. We also >>>>> hope to >>>>> >>>>>>> promote them to adapt to this type. >>>>> >>>>>>> >>>>> >>>>>>> It is worth noting that we also need to standardize many >>>>> functions >>>>> >>>>>>> related to it. >>>>> >>>>>>> >>>>> >>>>>>> A neutral place to maintain it is a great choice. >>>>> >>>>>>> >>>>> >>>>>>> - As Gang Wu said, a standalone project is good, just like >>>>> >>>>> RoaringBitmap >>>>> >>>>>>> [1]. >>>>> >>>>>>> - As Ryan said, Parquet community is a neutral option too. >>>>> >>>>>>> - As Micah said, Arrow is also an option too. >>>>> >>>>>>> >>>>> >>>>>>> [1] https://github.com/RoaringBitmap >>>>> >>>>>>> >>>>> >>>>>>> Best, >>>>> >>>>>>> Jingsong >>>>> >>>>>>> >>>>> >>>>>>> On Fri, Aug 16, 2024 at 7:18 AM Micah Kornfield < >>>>> >>>> emkornfi...@gmail.com >>>>> >>>>>> >>>>> >>>>>>> wrote: >>>>> >>>>>>>>> >>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been >>>>> direct and >>>>> >>>>> off >>>>> >>>>>>> the dev list. Would you like to make the request on the public >>>>> Spark >>>>> >>>>> Dev >>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>>>> email >>>>> >>>> if >>>>> >>>>>> you >>>>> >>>>>>> don't have time. >>>>> >>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>>> I think once we come to consensus, if you have bandwidth, I >>>>> think >>>>> >>>> the >>>>> >>>>>>> message might be better coming from you, as you have more >>>>> context on >>>>> >>>>> some >>>>> >>>>>>> of the non-public conversations, the requirements from an >>>>> Iceberg >>>>> >>>>>>> perspective on governance and the blockers that were >>>>> encountered. If >>>>> >>>>>>> details on the conversations can't be shared, (i.e. we are >>>>> starting >>>>> >>>>> from >>>>> >>>>>>> scratch) it seems like suggesting a new project via SPIP might >>>>> be the >>>>> >>>>> way >>>>> >>>>>>> forward. I'm happy to help with that if it is useful but I >>>>> would >>>>> >>>> guess >>>>> >>>>>>> Aihua or Tyler might be in a better place to start as it seems >>>>> they >>>>> >>>>> have >>>>> >>>>>>> done more serious thinking here. >>>>> >>>>>>>> >>>>> >>>>>>>> If we decide to try to standardize on Parquet or Arrow I'm >>>>> happy to >>>>> >>>>>> help >>>>> >>>>>>> support the effort in those communities. >>>>> >>>>>>>> >>>>> >>>>>>>> Thanks, >>>>> >>>>>>>> Micah >>>>> >>>>>>>> >>>>> >>>>>>>> On Thu, Aug 15, 2024 at 8:09 AM Russell Spitzer < >>>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>>>>> >>>>> >>>>>>>>> Thats fair @Micah, so far all the discussions have been >>>>> direct and >>>>> >>>>> off >>>>> >>>>>>> the dev list. Would you like to make the request on the public >>>>> Spark >>>>> >>>>> Dev >>>>> >>>>>>> list? I would be glad to co-sign, I can also draft up a quick >>>>> email >>>>> >>>> if >>>>> >>>>>> you >>>>> >>>>>>> don't have time. >>>>> >>>>>>>>> >>>>> >>>>>>>>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield < >>>>> >>>>>> emkornfi...@gmail.com> >>>>> >>>>>>> wrote: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>>>> the >>>>> >>>>> main >>>>> >>>>>>> problem is political and not logistic. I've been asking for >>>>> movement >>>>> >>>>> from >>>>> >>>>>>> other relative projects for a month and we simply haven't >>>>> gotten >>>>> >>>>>> anywhere. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> I just wanted to double check that these issues were brought >>>>> >>>>> directly >>>>> >>>>>>> to the spark community (i.e. a discussion thread on the Spark >>>>> >>>> developer >>>>> >>>>>>> mailing list) and not via backchannels. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> I'm not sure the outcome would be different and I don't >>>>> think >>>>> >>>> this >>>>> >>>>>>> should block forking the spec, but we should make sure that the >>>>> >>>>> decision >>>>> >>>>>> is >>>>> >>>>>>> publicly documented within both communities. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> Thanks, >>>>> >>>>>>>>>> Micah >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer < >>>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> @Gang Wu >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> I agree that it would be beneficial to make a sub-project, >>>>> the >>>>> >>>>> main >>>>> >>>>>>> problem is political and not logistic. I've been asking for >>>>> movement >>>>> >>>>> from >>>>> >>>>>>> other relative projects for a month and we simply haven't >>>>> gotten >>>>> >>>>>> anywhere. >>>>> >>>>>>> I don't think there is anything that would stop us from moving >>>>> to a >>>>> >>>>> joint >>>>> >>>>>>> project in the future and if you know of some way of >>>>> encouraging that >>>>> >>>>>>> movement from other relevant parties I would be glad to >>>>> collaborate >>>>> >>>> in >>>>> >>>>>>> doing that. One thing that I don't want to do is have the >>>>> Iceberg >>>>> >>>>> project >>>>> >>>>>>> stay in a holding pattern without any clear roadmap as to how >>>>> to >>>>> >>>>> proceed. >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu < >>>>> flyrain...@gmail.com >>>>> >>>>> >>>>> >>>>>>> wrote: >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> I’m on board with copying the spec into our repository. >>>>> >>>> However, >>>>> >>>>> as >>>>> >>>>>>> we’ve talked about, it’s not just a straightforward copy—there >>>>> are >>>>> >>>>>> already >>>>> >>>>>>> some divergences. Some of them are under discussion. Iceberg is >>>>> >>>>>> definitely >>>>> >>>>>>> the best place for these specs. Engines like Trino and Flink >>>>> can then >>>>> >>>>>> rely >>>>> >>>>>>> on the Iceberg specs as a solid foundation. >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> Yufei >>>>> >>>>>>>>>>>> >>>>> >>>>>>>>>>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu <ust...@gmail.com >>>>> > >>>>> >>>>> wrote: >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> Sorry for chiming in late. >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> From the discussion in >>>>> >>>>>>> >>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I >>>>> >>>>>> don't >>>>> >>>>>>> quite understand why it is logistically complicated to create a >>>>> >>>>>> sub-project >>>>> >>>>>>> to hold the variant spec and impl. >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> IMHO, coping the variant type spec into Apache Iceberg >>>>> has >>>>> >>>> some >>>>> >>>>>>> deficiencies: >>>>> >>>>>>>>>>>>> - It is a burden to update two repos if there is a >>>>> variant >>>>> >>>> type >>>>> >>>>>>> spec change and will likely result in deviation if some >>>>> changes do >>>>> >>>> not >>>>> >>>>>>> reach agreement from both parties. >>>>> >>>>>>>>>>>>> - Implementers are required to keep an eye on both specs >>>>> >>>>>>> (considering proprietary engines where both Iceberg and Delta >>>>> are >>>>> >>>>>>> supported). >>>>> >>>>>>>>>>>>> - Putting the spec and impl of variant type in Iceberg >>>>> repo >>>>> >>>> does >>>>> >>>>>>> lose the opportunity for better native support from file >>>>> formats like >>>>> >>>>>>> Parquet and ORC. >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> I'm not sure if it is possible to create a separate >>>>> project >>>>> >>>>> (e.g. >>>>> >>>>>>> apache/variant-type) to make it a single point of truth. We >>>>> can learn >>>>> >>>>>> from >>>>> >>>>>>> the experience of Apache Arrow. In this fashion, different >>>>> engines, >>>>> >>>>> table >>>>> >>>>>>> formats and file formats can follow the same spec and are free >>>>> to >>>>> >>>>> depend >>>>> >>>>>> on >>>>> >>>>>>> the reference implementations from apache/variant-type or >>>>> implement >>>>> >>>>> their >>>>> >>>>>>> own. >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> Best, >>>>> >>>>>>>>>>>>> Gang >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> >>>>> >>>>>>>>>>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye < >>>>> yezhao...@gmail.com >>>>> >>>>> >>>>> >>>>>>> wrote: >>>>> >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> +1 for copying the spec into our repository, I think we >>>>> need >>>>> >>>> to >>>>> >>>>>>> own it fully as a part of the table spec, and we can build >>>>> >>>>> compatibility >>>>> >>>>>>> through tests. >>>>> >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> -Jack >>>>> >>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer < >>>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>> I'm not really in favor of linking and annotating as >>>>> that >>>>> >>>> just >>>>> >>>>>>> makes things more complicated and still is essentially forking >>>>> just >>>>> >>>>> with >>>>> >>>>>>> more steps. If we just track our annotations / modifications >>>>> to a >>>>> >>>>> single >>>>> >>>>>>> commit/version then we have the same issue again but now you >>>>> have to >>>>> >>>> go >>>>> >>>>>> to >>>>> >>>>>>> multiple sources to get the actual Spec. In addition, our very >>>>> copy >>>>> >>>> of >>>>> >>>>>> the >>>>> >>>>>>> Spec is going to require new types which don't exist in the >>>>> Spark >>>>> >>>> Spec >>>>> >>>>>>> which necessarily means diverging. We will need to take up new >>>>> >>>>> primitive >>>>> >>>>>>> id's (as noted in my first email) >>>>> >>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>> The other issue I have is I don't think the Spark Spec >>>>> is >>>>> >>>>> really >>>>> >>>>>>> going through a thorough review process from all members of >>>>> the Spark >>>>> >>>>>>> community, I believe it probably should have gone through the >>>>> SPIP >>>>> >>>> but >>>>> >>>>>>> instead seems to have been merged without broad community >>>>> >>>> involvement. >>>>> >>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>> The only way to truly avoid diverging is to only have a >>>>> >>>> single >>>>> >>>>>>> copy of the spec, in our previous discussions the vast >>>>> majority of >>>>> >>>>> Apache >>>>> >>>>>>> Iceberg community want it to exist here. >>>>> >>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks < >>>>> >>>>> dwe...@apache.org >>>>> >>>>>>> >>>>> >>>>>>> wrote: >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> I'm really excited about the introduction of variant >>>>> type >>>>> >>>> to >>>>> >>>>>>> Iceberg, but I want to raise concerns about forking the spec. >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> I feel like preemptively forking would create the >>>>> situation >>>>> >>>>>>> where we end up diverging because there's little reason to >>>>> work with >>>>> >>>>> both >>>>> >>>>>>> communities to evolve in a way that benefits everyone. >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> I would much rather point to a specific version of >>>>> the spec >>>>> >>>>> and >>>>> >>>>>>> annotate any variance in Iceberg's handling. This would allow >>>>> us to >>>>> >>>>>>> continue without dividing the communities. >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> If at any point there are irreconcilable differences, >>>>> I >>>>> >>>> would >>>>> >>>>>>> support forking, but I don't feel like that should be the >>>>> initial >>>>> >>>> step. >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> No one is excited about the possibility that the >>>>> physical >>>>> >>>>>>> representations end up diverging, but it feels like we're >>>>> setting >>>>> >>>>>> ourselves >>>>> >>>>>>> up for that exact scenario. >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> -Dan >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong < >>>>> >>>>>>> fo...@apache.org> wrote: >>>>> >>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>> +1 to what's already being said here. It is good to >>>>> copy >>>>> >>>> the >>>>> >>>>>>> spec to Iceberg and add context that's specific to Iceberg, >>>>> but at >>>>> >>>> the >>>>> >>>>>> same >>>>> >>>>>>> time, we should maintain compatibility. >>>>> >>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>> Kind regards, >>>>> >>>>>>>>>>>>>>>>> Fokko >>>>> >>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang < >>>>> >>>>>>> owenzhang1...@gmail.com>: >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> +1 to copy the spec into our repository. I think >>>>> the best >>>>> >>>>> way >>>>> >>>>>>> to keep compatibility is building integration tests. >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>> >>>>>>>>>>>>>>>>>> Manu >>>>> >>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry < >>>>> >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> Thanks Russell and Aihua for pushing Variant >>>>> support! >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> Given the differences between the supported types >>>>> and >>>>> >>>> the >>>>> >>>>>>> lack of interest from the other project, I think it is >>>>> reasonable to >>>>> >>>>>>> duplicate the specification to our repository. >>>>> >>>>>>>>>>>>>>>>>>> I would give very strong emphasis on sticking to >>>>> the >>>>> >>>> Spark >>>>> >>>>>>> spec as much as possible, to keep compatibility as much as >>>>> possible. >>>>> >>>>>> Maybe >>>>> >>>>>>> even revert to a shared specification if the situation changes. >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>> >>>>>>>>>>>>>>>>>>> Peter >>>>> >>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>> Aihua Xu <aihu...@gmail.com> ezt írta (időpont: >>>>> 2024. >>>>> >>>>> aug. >>>>> >>>>>>> 13., K, 19:52): >>>>> >>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>> Thanks Russell for bringing this up. >>>>> >>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>> This is the main blocker to move forward with the >>>>> >>>> Variant >>>>> >>>>>>> support in Iceberg and hopefully we can have a consensus. To >>>>> me, I >>>>> >>>> also >>>>> >>>>>>> feel it makes more sense to move the spec into Iceberg rather >>>>> than >>>>> >>>>> Spark >>>>> >>>>>>> engine owns it and we try to keep it compatible with Spark >>>>> spec. >>>>> >>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>> >>>>>>>>>>>>>>>>>>>> Aihua >>>>> >>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer < >>>>> >>>>>>> russell.spit...@gmail.com> wrote: >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> Hi Y’all, >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> We’ve hit a bit of a roadblock with the Variant >>>>> >>>>> Proposal, >>>>> >>>>>>> while we were hoping to move the Variant and Shredding >>>>> specifications >>>>> >>>>>> from >>>>> >>>>>>> Spark into Iceberg there doesn’t seem to be a lot of interest >>>>> in >>>>> >>>> that. >>>>> >>>>>>> Unfortunately, I think we have a number of issues with just >>>>> linking >>>>> >>>> to >>>>> >>>>>> the >>>>> >>>>>>> Spark project directly from within Iceberg and I believe we >>>>> need to >>>>> >>>>> copy >>>>> >>>>>>> the specifications into our repository. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> There are a few reasons why i think this is >>>>> necessary >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> First, we have a divergence of types already. The >>>>> >>>> Spark >>>>> >>>>>>> Specification already includes types which Iceberg has no >>>>> definition >>>>> >>>>> for >>>>> >>>>>>> (19, 20 - Interval Types) and Iceberg already has a type which >>>>> is not >>>>> >>>>>>> included within the Spark Specification (Time) and will soon >>>>> have >>>>> >>>> more >>>>> >>>>>> with >>>>> >>>>>>> TimestampNS, and Geo. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> Second, We would like to make sure that Spark is >>>>> not a >>>>> >>>>>> hard >>>>> >>>>>>> dependency for other engines. We are working with several >>>>> >>>> implementers >>>>> >>>>> of >>>>> >>>>>>> the Iceberg spec and it has previously been agreed that it >>>>> would be >>>>> >>>>> best >>>>> >>>>>> if >>>>> >>>>>>> the source of truth for Variant existed in an engine and file >>>>> format >>>>> >>>>>>> neutral location. The Iceberg project has a good open model of >>>>> >>>>> governance >>>>> >>>>>>> and, as we have seen so far discussing Variant, open and active >>>>> >>>>>>> collaboration. This would also help as we can strictly version >>>>> our >>>>> >>>>>> changes >>>>> >>>>>>> in-line with the rest of the Iceberg spec. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> Third, The Shredding spec is not quite finished >>>>> and >>>>> >>>>>>> requires some group analysis and discussion before we commit >>>>> it. I >>>>> >>>>> think >>>>> >>>>>>> again the Iceberg community is probably the right place for >>>>> this to >>>>> >>>>>> happen >>>>> >>>>>>> as we have already started discussions here on these topics. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> For these reasons I think we should go with a >>>>> direct >>>>> >>>>> copy >>>>> >>>>>>> of the existing specification from the Spark Project and move >>>>> ahead >>>>> >>>>> with >>>>> >>>>>>> our discussions and modifications within Iceberg. That said, I >>>>> do not >>>>> >>>>>> want >>>>> >>>>>>> to diverge if possible from the Spark proposal. For example, >>>>> although >>>>> >>>>> we >>>>> >>>>>> do >>>>> >>>>>>> not use the Interval types above, I think we should not reuse >>>>> those >>>>> >>>>> type >>>>> >>>>>>> ids within our spec. Iceberg's Variant Spec types 19 and 20 >>>>> would >>>>> >>>>> remain >>>>> >>>>>>> unused along with any other types we think are not applicable. >>>>> We >>>>> >>>>> should >>>>> >>>>>>> strive whenever possible to allow for compatibility. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> In the interest of moving forward with this >>>>> proposal I >>>>> >>>>> am >>>>> >>>>>>> hoping to see if anyone in the community objects to this plan >>>>> going >>>>> >>>>>> forward >>>>> >>>>>>> or has a better alternative. >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> As always I am thankful for your time and am >>>>> eager to >>>>> >>>>> hear >>>>> >>>>>>> back from everyone, >>>>> >>>>>>>>>>>>>>>>>>>>> Russ >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>>>>>>>>>>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>>> >>> >>>>> >> >>>>> > >>>>> >>>>