> We should probably add a section to our Flink docs that explains and links to Flink’s support policy and has a table of Iceberg versions that work with Flink versions. (We should probably have the same table for Spark, too!)
Thanks Ryan for the suggestion, I created a separate issue to address this thing before: https://github.com/apache/iceberg/issues/3115 . I will make this forward. On Thu, Oct 7, 2021 at 1:55 PM Jack Ye <yezhao...@gmail.com> wrote: > Hi everyone, > > I tried to prototype option 3, here is the PR: > https://github.com/apache/iceberg/pull/3237 > > Sorry I did not see that Anton is planning to do it, but anyway it's just > a draft, so feel free to just use it as reference. > > Best, > Jack Ye > > On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <b...@tabular.io> wrote: > >> Thanks for the context on the Flink side! I think it sounds reasonable to >> keep up to date with the latest supported Flink version. If we want, we >> could later go with something similar to what we do for Spark but we’ll see >> how it goes and what the Flink community needs. We should probably add a >> section to our Flink docs that explains and links to Flink’s support policy >> and has a table of Iceberg versions that work with Flink versions. (We >> should probably have the same table for Spark, too!) >> >> For Spark, I’m also leaning toward the modified option 3 where we keep >> all of the code in the main repository but only build with one module at a >> time by default. It makes sense to switch based on modules — rather than >> selecting src paths within a module — so that it is easy to run a build >> with all modules if you choose to — for example, when building release >> binaries. >> >> The reason I think we should go with option 3 is for testing. If we have >> a single repo with api, core, etc. and spark then changes to the common >> modules can be tested by CI actions. Updates to individual Spark modules >> would be completely independent. There is a slight inconvenience that when >> an API used by Spark changes, the author would still need to fix multiple >> Spark versions. But the trade-off is that with a separate repository like >> option 2, changes that break Spark versions are not caught and then the >> Spark repository’s CI ends up failing on completely unrelated changes. That >> would be a major pain, felt by everyone contributing to the Spark >> integration, so I think option 3 is the best path forward. >> >> It sounds like we probably have some agreement now, but please speak up >> if you think another option would be better. >> >> The next step is to prototype the build changes to test out option 3. Or >> if you prefer option 2, then prototype those changes as well. I think that >> Anton is planning to do this, but if you have time and the desire to do it >> please reach out and coordinate with us! >> >> Ryan >> >> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> Wing, sorry, my earlier message probably misled you. I was speaking my >>> personal opinion on Flink version support. >>> >>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon >>> <wyp...@cloudera.com.invalid> wrote: >>> >>>> Hi OpenInx, >>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks >>>> for the clarification. >>>> - Wing Yew >>>> >>>> >>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote: >>>> >>>>> Hi Wing >>>>> >>>>> As we discussed above, we community prefer to choose option.2 or >>>>> option.3. So in fact, when we planned to upgrade the flink version from >>>>> 1.12 to 1.13, we are doing our best to guarantee the master iceberg repo >>>>> could work fine for both flink1.12 & flink1.13. More context please see >>>>> [1], [2], [3] >>>>> >>>>> [1] https://github.com/apache/iceberg/pull/3116 >>>>> [2] https://github.com/apache/iceberg/issues/3183 >>>>> [3] >>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E >>>>> >>>>> >>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon >>>>> <wyp...@cloudera.com.invalid> wrote: >>>>> >>>>>> In the last community sync, we spent a little time on this topic. For >>>>>> Spark support, there are currently two options under consideration: >>>>>> >>>>>> Option 2: Separate repo for the Spark support. Use branches for >>>>>> supporting different Spark versions. Main branch for the latest Spark >>>>>> version (3.2 to begin with). >>>>>> Tooling needs to be built for producing regular snapshots of core >>>>>> Iceberg in a consumable way for this repo. Unclear if commits to core >>>>>> Iceberg will be tested pre-commit against Spark support; my impression is >>>>>> that they will not be, and the Spark support build can be broken by >>>>>> changes >>>>>> to core. >>>>>> >>>>>> A variant of option 3 (which we will simply call Option 3 going >>>>>> forward): Single repo, separate module (subdirectory) for each Spark >>>>>> version to be supported. Code duplication in each Spark module (no >>>>>> attempt >>>>>> to refactor out common code). Each module built against the specific >>>>>> version of Spark to be supported, producing a runtime jar built against >>>>>> that version. CI will test all modules. Support can be provided for only >>>>>> building the modules a developer cares about. >>>>>> >>>>>> More input was sought and people are encouraged to voice their >>>>>> preference. >>>>>> I lean towards Option 3. >>>>>> >>>>>> - Wing Yew >>>>>> >>>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the >>>>>> same multi-version support strategy can be adopted across engines. Based >>>>>> on >>>>>> what Steven wrote, currently the Flink developer community's bandwidth >>>>>> makes supporting only a single Flink version (and focusing resources on >>>>>> developing new features on that version) the preferred choice. If so, >>>>>> then >>>>>> no multi-version support strategy for Flink is needed at this time. >>>>>> >>>>>> >>>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> During the sync meeting, people talked about if and how we can have >>>>>>> the same version support model across engines like Flink and Spark. I >>>>>>> can >>>>>>> provide some input from the Flink side. >>>>>>> >>>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 >>>>>>> is the latest released version. That means only Flink 1.12 and 1.13 are >>>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 >>>>>>> and >>>>>>> 1.13, unless it is a serious bug (like security). With that context, >>>>>>> personally I like option 1 (with one actively supported Flink version in >>>>>>> master branch) for the iceberg-flink module. >>>>>>> >>>>>>> We discussed the idea of supporting multiple Flink versions via shm >>>>>>> layer and multiple modules. While it may be a little better to support >>>>>>> multiple Flink versions, I don't know if there is enough support and >>>>>>> resources from the community to pull it off. Also the ongoing >>>>>>> maintenance >>>>>>> burden for each minor version release from Flink, which happens roughly >>>>>>> every 4 months. >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary >>>>>>> <pv...@cloudera.com.invalid> wrote: >>>>>>> >>>>>>>> Since you mentioned Hive, I chime in with what we do there. You >>>>>>>> might find it useful: >>>>>>>> - metastore module - only small differences - DynConstructor solves >>>>>>>> for us >>>>>>>> - mr module - some bigger differences, but still manageable for >>>>>>>> Hive 2-3. Need some new classes, but most of the code is reused - extra >>>>>>>> module for Hive 3. For Hive 4 we use a different repo as we moved to >>>>>>>> the >>>>>>>> Hive codebase. >>>>>>>> >>>>>>>> My thoughts based on the above experience: >>>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly >>>>>>>> have problems with backporting changes between repos and we are >>>>>>>> slacking >>>>>>>> behind which hurts both projects >>>>>>>> - Hive 2-3 model is working better by forcing us to keep the things >>>>>>>> in sync, but with serious differences in the Hive project it still >>>>>>>> doesn't >>>>>>>> seem like a viable option. >>>>>>>> >>>>>>>> So I think the question is: How stable is the Spark code we are >>>>>>>> integrating to. If I is fairly stable then we are better off with a >>>>>>>> "one >>>>>>>> repo multiple modules" approach and we should consider the multirepo >>>>>>>> only >>>>>>>> if the differences become prohibitive. >>>>>>>> >>>>>>>> Thanks, Peter >>>>>>>> >>>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, >>>>>>>> <aokolnyc...@apple.com.invalid> wrote: >>>>>>>> >>>>>>>>> Okay, looks like there is consensus around supporting multiple >>>>>>>>> Spark versions at the same time. There are folks who mentioned this >>>>>>>>> on this >>>>>>>>> thread and there were folks who brought this up during the sync. >>>>>>>>> >>>>>>>>> Let’s think through Option 2 and 3 in more detail then. >>>>>>>>> >>>>>>>>> Option 2 >>>>>>>>> >>>>>>>>> In Option 2, there will be a separate repo. I believe the master >>>>>>>>> branch will soon point to Spark 3.2 (the most recent supported >>>>>>>>> version). >>>>>>>>> The main development will happen there and the artifact version will >>>>>>>>> be >>>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 >>>>>>>>> branches where we will cherry-pick applicable changes. Once we are >>>>>>>>> ready to >>>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and >>>>>>>>> cut 3 >>>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump >>>>>>>>> the >>>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and >>>>>>>>> 0.2.x-spark-3.1 >>>>>>>>> branches for cherry-picks. >>>>>>>>> >>>>>>>>> I guess we will continue to shade everything in the new repo and >>>>>>>>> will have to release every time the core is released. We will do a >>>>>>>>> maintenance release for each supported Spark version whenever we cut >>>>>>>>> a new >>>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark >>>>>>>>> integration. >>>>>>>>> Under this model, we will probably need nightly snapshots (or on >>>>>>>>> each commit) for the core format and the Spark integration will >>>>>>>>> depend on >>>>>>>>> snapshots until we are ready to release. >>>>>>>>> >>>>>>>>> Overall, I think this option gives us very simple builds and >>>>>>>>> provides best separation. It will keep the main repo clean. The main >>>>>>>>> downside is that we will have to split a Spark feature into two PRs: >>>>>>>>> one >>>>>>>>> against the core and one against the Spark integration. Certain >>>>>>>>> changes in >>>>>>>>> core can also break the Spark integration too and will require >>>>>>>>> adaptations. >>>>>>>>> >>>>>>>>> Ryan, I am not sure I fully understood the testing part. How will >>>>>>>>> we be able to test the Spark integration in the main repo if certain >>>>>>>>> changes in core may break the Spark integration and require changes >>>>>>>>> there? >>>>>>>>> Will we try to prohibit such changes? >>>>>>>>> >>>>>>>>> Option 3 (modified) >>>>>>>>> >>>>>>>>> If I get correctly, the modified Option 3 sounds very close to >>>>>>>>> the initially suggested approach by Imran but with code duplication >>>>>>>>> instead >>>>>>>>> of extra refactoring and introducing new common modules. >>>>>>>>> >>>>>>>>> Jack, are you suggesting we test only a single Spark version at a >>>>>>>>> time? Or do we expect to test all versions? Will there be any >>>>>>>>> difference >>>>>>>>> compared to just having a module per version? I did not fully >>>>>>>>> understand. >>>>>>>>> >>>>>>>>> My worry with this approach is that our build will be very >>>>>>>>> complicated and we will still have a lot of Spark-related modules in >>>>>>>>> the >>>>>>>>> main repo. Once people start using Flink and Hive more, will we have >>>>>>>>> to do >>>>>>>>> the same? >>>>>>>>> >>>>>>>>> - Anton >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>> I'd support the option that Jack suggests if we can set a few >>>>>>>>> expectations for keeping it clean. >>>>>>>>> >>>>>>>>> First, I'd like to avoid refactoring code to share it across Spark >>>>>>>>> versions -- that introduces risk because we're relying on compiling >>>>>>>>> against >>>>>>>>> one version and running in another and both Spark and Scala change >>>>>>>>> rapidly. >>>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one >>>>>>>>> Spark >>>>>>>>> version. I think we should duplicate code rather than spend time >>>>>>>>> refactoring to rely on binary compatibility. I propose we start each >>>>>>>>> new >>>>>>>>> Spark version by copying the last one and updating it. And we should >>>>>>>>> build >>>>>>>>> just the latest supported version by default. >>>>>>>>> >>>>>>>>> The drawback to having everything in a single repo is that we >>>>>>>>> wouldn't be able to cherry-pick changes across Spark >>>>>>>>> versions/branches, but >>>>>>>>> I think Jack is right that having a single build is better. >>>>>>>>> >>>>>>>>> Second, we should make CI faster by running the Spark builds in >>>>>>>>> parallel. It sounds like this is what would happen anyway, with a >>>>>>>>> property >>>>>>>>> that selects the Spark version that you want to build against. >>>>>>>>> >>>>>>>>> Overall, this new suggestion sounds like a promising way forward. >>>>>>>>> >>>>>>>>> Ryan >>>>>>>>> >>>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I think in Ryan's proposal we will create a ton of modules >>>>>>>>>> anyway, as Wing listed we are just using git branch as an additional >>>>>>>>>> dimension, but my understanding is that you will still have 1 core, 1 >>>>>>>>>> extension, 1 runtime artifact published for each Spark version in >>>>>>>>>> either >>>>>>>>>> approach. >>>>>>>>>> >>>>>>>>>> In that case, this is just brainstorming, I wonder if we can >>>>>>>>>> explore a modified option 3 that flattens all the versions in each >>>>>>>>>> Spark >>>>>>>>>> branch in option 2 into master. The repository structure would look >>>>>>>>>> something like: >>>>>>>>>> >>>>>>>>>> iceberg/api/... >>>>>>>>>> /bundled-guava/... >>>>>>>>>> /core/... >>>>>>>>>> ... >>>>>>>>>> /spark/2.4/core/... >>>>>>>>>> /extension/... >>>>>>>>>> /runtime/... >>>>>>>>>> /3.1/core/... >>>>>>>>>> /extension/... >>>>>>>>>> /runtime/... >>>>>>>>>> >>>>>>>>>> The gradle build script in the root is configured to build >>>>>>>>>> against the latest version of Spark by default, unless otherwise >>>>>>>>>> specified >>>>>>>>>> by the user. >>>>>>>>>> >>>>>>>>>> Intellij can also be configured to only index files of specific >>>>>>>>>> versions based on the same config used in build. >>>>>>>>>> >>>>>>>>>> In this way, I imagine the CI setup to be much easier to do >>>>>>>>>> things like testing version compatibility for a feature or running >>>>>>>>>> only a >>>>>>>>>> specific subset of Spark version builds based on the Spark version >>>>>>>>>> directories touched. >>>>>>>>>> >>>>>>>>>> And the biggest benefit is that we don't have the same difficulty >>>>>>>>>> as option 2 of developing a feature when it's both in core and Spark. >>>>>>>>>> >>>>>>>>>> We can then develop a mechanism to vote to stop support of >>>>>>>>>> certain versions, and archive the corresponding directory to avoid >>>>>>>>>> accumulating too many versions in the long term. >>>>>>>>>> >>>>>>>>>> -Jack Ye >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java >>>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a >>>>>>>>>>> big >>>>>>>>>>> thing to leave out! >>>>>>>>>>> >>>>>>>>>>> I would definitely want to test the projects together. One thing >>>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also >>>>>>>>>>> wondering if we could have some tighter integration where the >>>>>>>>>>> Iceberg Spark >>>>>>>>>>> build can be included in the Iceberg Java build using properties. >>>>>>>>>>> Maybe the >>>>>>>>>>> github action could checkout Iceberg, then checkout the Spark >>>>>>>>>>> integration's latest branch, and then run the gradle build with a >>>>>>>>>>> property >>>>>>>>>>> that makes Spark a subproject in the build. That way we can >>>>>>>>>>> continue to >>>>>>>>>>> have Spark CI run regularly. >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer < >>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I agree that Option 2 is considerably more difficult for >>>>>>>>>>>> development when core API changes need to be picked up by the >>>>>>>>>>>> external >>>>>>>>>>>> Spark module. I also think a monthly release would probably still >>>>>>>>>>>> be >>>>>>>>>>>> prohibitive to actually implementing new features that appear in >>>>>>>>>>>> the API, I >>>>>>>>>>>> would hope we have a much faster process or maybe just have >>>>>>>>>>>> snapshot >>>>>>>>>>>> artifacts published nightly? >>>>>>>>>>>> >>>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon < >>>>>>>>>>>> wyp...@cloudera.com.INVALID> wrote: >>>>>>>>>>>> >>>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a >>>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such >>>>>>>>>>>> as >>>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can >>>>>>>>>>>> be >>>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would >>>>>>>>>>>> need to >>>>>>>>>>>> commit the changes to all applicable branches. Basically we are >>>>>>>>>>>> trading >>>>>>>>>>>> more work to commit to multiple branches for simplified build and >>>>>>>>>>>> CI >>>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, >>>>>>>>>>>> the >>>>>>>>>>>> biggest downside is that changes may need to be made in core >>>>>>>>>>>> Iceberg as >>>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to >>>>>>>>>>>> wait for >>>>>>>>>>>> a release of core Iceberg to consume the changes in the >>>>>>>>>>>> subproject. In this >>>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no >>>>>>>>>>>> matter how >>>>>>>>>>>> many changes go in, as long as it is non-zero) so that the >>>>>>>>>>>> subproject can >>>>>>>>>>>> consume changes fairly quickly? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the >>>>>>>>>>>>> set of potential solutions well defined. >>>>>>>>>>>>> >>>>>>>>>>>>> Looks like the next step is to decide whether we want to >>>>>>>>>>>>> require people to update Spark versions to pick up newer versions >>>>>>>>>>>>> of >>>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is >>>>>>>>>>>>> clearly the >>>>>>>>>>>>> best choice. >>>>>>>>>>>>> >>>>>>>>>>>>> I don’t think that we should make updating Spark a >>>>>>>>>>>>> requirement. Many of the things that we’re working on are >>>>>>>>>>>>> orthogonal to >>>>>>>>>>>>> Spark versions, like table maintenance actions, secondary >>>>>>>>>>>>> indexes, the 1.0 >>>>>>>>>>>>> API, views, ORC delete files, new storage implementations, etc. >>>>>>>>>>>>> Upgrading >>>>>>>>>>>>> Spark is time consuming and untrusted in my experience, so I >>>>>>>>>>>>> think we would >>>>>>>>>>>>> be setting up an unnecessary trade-off between spending lots of >>>>>>>>>>>>> time to >>>>>>>>>>>>> upgrade Spark and picking up new Iceberg features. >>>>>>>>>>>>> >>>>>>>>>>>>> Another way of thinking about this is that if we went with >>>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are >>>>>>>>>>>>> many >>>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO >>>>>>>>>>>>> implementation >>>>>>>>>>>>> for ADLS. So some people in the community would have to maintain >>>>>>>>>>>>> branches >>>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of >>>>>>>>>>>>> the main >>>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things >>>>>>>>>>>>> with >>>>>>>>>>>>> option 1 because we would then have more people maintaining the >>>>>>>>>>>>> same 0.13.x >>>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, >>>>>>>>>>>>> where we >>>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the >>>>>>>>>>>>> community >>>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, >>>>>>>>>>>>> Tencent, >>>>>>>>>>>>> Apple, etc.) >>>>>>>>>>>>> >>>>>>>>>>>>> If the community is going to do the work anyway — and I think >>>>>>>>>>>>> some of us would — we should make it possible to share that work. >>>>>>>>>>>>> That’s >>>>>>>>>>>>> why I don’t think that we should go with option 1. >>>>>>>>>>>>> >>>>>>>>>>>>> If we don’t go with option 1, then the choice is how to >>>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re >>>>>>>>>>>>> doing it right >>>>>>>>>>>>> now is not something we want to continue. >>>>>>>>>>>>> >>>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because >>>>>>>>>>>>> of the changes in Spark. We currently structure the library to >>>>>>>>>>>>> share as >>>>>>>>>>>>> much code as possible. But that means compiling against different >>>>>>>>>>>>> Spark >>>>>>>>>>>>> versions and relying on binary compatibility and reflection in >>>>>>>>>>>>> some cases. >>>>>>>>>>>>> To me, this seems unmaintainable in the long run because it >>>>>>>>>>>>> requires >>>>>>>>>>>>> refactoring common classes and spending a lot of time >>>>>>>>>>>>> deduplicating code. >>>>>>>>>>>>> It also creates a ton of modules, at least one common module, >>>>>>>>>>>>> then a module >>>>>>>>>>>>> per version, then an extensions module per version, and finally a >>>>>>>>>>>>> runtime >>>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any >>>>>>>>>>>>> new common >>>>>>>>>>>>> modules. And each module needs to be tested, which is making our >>>>>>>>>>>>> CI take a >>>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, >>>>>>>>>>>>> which is >>>>>>>>>>>>> another gap that will require even more modules and tests. >>>>>>>>>>>>> >>>>>>>>>>>>> I like option 2 because it would allow us to compile against a >>>>>>>>>>>>> single version of Spark (which will be much more reliable). It >>>>>>>>>>>>> would give >>>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids >>>>>>>>>>>>> the need >>>>>>>>>>>>> to refactor to share code and allows people to focus on a single >>>>>>>>>>>>> version of >>>>>>>>>>>>> Spark, while also creating a way for people to maintain and >>>>>>>>>>>>> update the >>>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that >>>>>>>>>>>>> this would >>>>>>>>>>>>> slow down development. I think it would actually speed it up >>>>>>>>>>>>> because we’d >>>>>>>>>>>>> be spending less time trying to make multiple versions work in >>>>>>>>>>>>> the same >>>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option >>>>>>>>>>>>> 1: you >>>>>>>>>>>>> don’t have to care about branches for older Spark versions. >>>>>>>>>>>>> >>>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single >>>>>>>>>>>>> repository, but I think that the need to manage more version >>>>>>>>>>>>> combinations >>>>>>>>>>>>> overrides this concern. It’s easier to make this decision in >>>>>>>>>>>>> python because >>>>>>>>>>>>> we’re not trying to depend on two projects that change relatively >>>>>>>>>>>>> quickly. >>>>>>>>>>>>> We’re just trying to build a library. >>>>>>>>>>>>> >>>>>>>>>>>>> Ryan >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for bringing this up, Anton. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Everyone has great pros/cons to support their preferences. >>>>>>>>>>>>>> Before giving my preference, let me raise one question: >>>>>>>>>>>>>> what's the top >>>>>>>>>>>>>> priority thing for apache iceberg project at this point in time >>>>>>>>>>>>>> ? This >>>>>>>>>>>>>> question will help us to answer the following question: Should >>>>>>>>>>>>>> we support >>>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive >>>>>>>>>>>>>> and >>>>>>>>>>>>>> concentrate on getting the new features that users need most in >>>>>>>>>>>>>> order to >>>>>>>>>>>>>> keep the project more competitive ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> If people watch the apache iceberg project and check the >>>>>>>>>>>>>> issues & PR frequently, I guess more than 90% people will >>>>>>>>>>>>>> answer the >>>>>>>>>>>>>> priority question: There is no doubt for making the whole v2 >>>>>>>>>>>>>> story to be >>>>>>>>>>>>>> production-ready. The current roadmap discussion also proofs >>>>>>>>>>>>>> the thing : >>>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>>> In order to ensure the highest priority at this point in >>>>>>>>>>>>>> time, I will prefer option-1 to reduce the cost of engine >>>>>>>>>>>>>> maintenance, so >>>>>>>>>>>>>> as to free up resources to make v2 production-ready. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao < >>>>>>>>>>>>>> sai.sai.s...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> From Dev's point, it has less burden to always support the >>>>>>>>>>>>>>> latest version of Spark (for example). But from user's point, >>>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy >>>>>>>>>>>>>>> to upgrade >>>>>>>>>>>>>>> the Spark version for the first time (since we have many >>>>>>>>>>>>>>> customizations >>>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> community ditches the support of old version of Spark3, users >>>>>>>>>>>>>>> have to >>>>>>>>>>>>>>> maintain it themselves unavoidably. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So I'm inclined to make this support in community, not by >>>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. >>>>>>>>>>>>>>> And to >>>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark >>>>>>>>>>>>>>> (for example >>>>>>>>>>>>>>> 2 versions). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just my two cents. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Saisai >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Wing Yew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to >>>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have >>>>>>>>>>>>>>>> very limited >>>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed >>>>>>>>>>>>>>>> about option 3 >>>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are >>>>>>>>>>>>>>>> seeing the >>>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we >>>>>>>>>>>>>>>> need a >>>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to >>>>>>>>>>>>>>>> make a good >>>>>>>>>>>>>>>> community guideline for all future engine versions, especially >>>>>>>>>>>>>>>> for Spark, >>>>>>>>>>>>>>>> Flink and Hive that are in the same repository. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact, >>>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support >>>>>>>>>>>>>>>> over 40 >>>>>>>>>>>>>>>> versions of the software because there are people who are >>>>>>>>>>>>>>>> still using Spark >>>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes >>>>>>>>>>>>>>>> will become a >>>>>>>>>>>>>>>> liability not only on the user side, but also on the service >>>>>>>>>>>>>>>> provider side, >>>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, >>>>>>>>>>>>>>>> as it will >>>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature >>>>>>>>>>>>>>>> is definitely >>>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its >>>>>>>>>>>>>>>> scalability, because we will have an unbounded list of >>>>>>>>>>>>>>>> packages to add and >>>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of >>>>>>>>>>>>>>>> that package >>>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still >>>>>>>>>>>>>>>> publish a few >>>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can >>>>>>>>>>>>>>>> control the >>>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the >>>>>>>>>>>>>>>> power of >>>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink >>>>>>>>>>>>>>>> and Hive. With >>>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for >>>>>>>>>>>>>>>> engine versions >>>>>>>>>>>>>>>> against Iceberg versions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon < >>>>>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new >>>>>>>>>>>>>>>>> DSv2 features in Spark 3.2. I agree that Option 1 is the >>>>>>>>>>>>>>>>> easiest for >>>>>>>>>>>>>>>>> developers, but I don't think it considers the interests of >>>>>>>>>>>>>>>>> users. I do not >>>>>>>>>>>>>>>>> think that most users will upgrade to Spark 3.2 as soon as it >>>>>>>>>>>>>>>>> is released. >>>>>>>>>>>>>>>>> It is a "minor version" upgrade in name from 3.1 (or from >>>>>>>>>>>>>>>>> 3.0), but I think >>>>>>>>>>>>>>>>> we all know that it is not a minor upgrade. There are a lot >>>>>>>>>>>>>>>>> of changes from >>>>>>>>>>>>>>>>> 3.0 to 3.1 and from 3.1 to 3.2. I think there are even a lot >>>>>>>>>>>>>>>>> of users >>>>>>>>>>>>>>>>> running Spark 2.4 and not even on Spark 3 yet. Do we also >>>>>>>>>>>>>>>>> plan to stop >>>>>>>>>>>>>>>>> supporting Spark 2.4? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have >>>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same >>>>>>>>>>>>>>>>> organization, don't >>>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, >>>>>>>>>>>>>>>>> all internal, >>>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already >>>>>>>>>>>>>>>>> running an >>>>>>>>>>>>>>>>> internal fork that is close to 3.2.) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I work for an organization with customers running >>>>>>>>>>>>>>>>> different versions of Spark. It is true that we can backport >>>>>>>>>>>>>>>>> new features >>>>>>>>>>>>>>>>> to older versions if we wanted to. I suppose the people >>>>>>>>>>>>>>>>> contributing to >>>>>>>>>>>>>>>>> Iceberg work for some organization or other that either use >>>>>>>>>>>>>>>>> Iceberg >>>>>>>>>>>>>>>>> in-house, or provide software (possibly in the form of a >>>>>>>>>>>>>>>>> service) to >>>>>>>>>>>>>>>>> customers, and either way, the organizations have the ability >>>>>>>>>>>>>>>>> to backport >>>>>>>>>>>>>>>>> features and fixes to internal versions. Are there any users >>>>>>>>>>>>>>>>> out there who >>>>>>>>>>>>>>>>> simply use Apache Iceberg and depend on the community version? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There may be features that are broadly useful that do not >>>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark >>>>>>>>>>>>>>>>> 3.0/3.1 (and even >>>>>>>>>>>>>>>>> 2.4)? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, >>>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules >>>>>>>>>>>>>>>>> are required; >>>>>>>>>>>>>>>>> what are the modules you're thinking of? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Wing Yew >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu < >>>>>>>>>>>>>>>>> flyrain...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. >>>>>>>>>>>>>>>>>> Considering the limited resources in the open source >>>>>>>>>>>>>>>>>> community, the upsides >>>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy. >>>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's >>>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are >>>>>>>>>>>>>>>>>> legit, users can >>>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older >>>>>>>>>>>>>>>>>> version in case of >>>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> `This is not a contribution` >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi < >>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> To sum up what we have so far: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 >>>>>>>>>>>>>>>>>>> version)* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to >>>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume >>>>>>>>>>>>>>>>>>> any new >>>>>>>>>>>>>>>>>>> Iceberg features. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the >>>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches. >>>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, >>>>>>>>>>>>>>>>>>> may slow down the development. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)* >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Introduce more modules in the same project. >>>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at >>>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the >>>>>>>>>>>>>>>>>>> build and testing >>>>>>>>>>>>>>>>>>> complicated. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark >>>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker? >>>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would >>>>>>>>>>>>>>>>>>> like to hear what other people think/need. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer < >>>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a >>>>>>>>>>>>>>>>>>> big fan of having runtime errors for unsupported things >>>>>>>>>>>>>>>>>>> based on versions >>>>>>>>>>>>>>>>>>> and I don't think minor version upgrades are a large issue >>>>>>>>>>>>>>>>>>> for users. I'm >>>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces >>>>>>>>>>>>>>>>>>> that only exist in >>>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi < >>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code >>>>>>>>>>>>>>>>>>> cross reference and >>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this >>>>>>>>>>>>>>>>>>> moment. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all >>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 >>>>>>>>>>>>>>>>>>> latest versions in a >>>>>>>>>>>>>>>>>>> major version. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to >>>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, >>>>>>>>>>>>>>>>>>> it means we have >>>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 >>>>>>>>>>>>>>>>>>> that is being >>>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial >>>>>>>>>>>>>>>>>>> differences. On top of >>>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level >>>>>>>>>>>>>>>>>>> and may break not >>>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it >>>>>>>>>>>>>>>>>>> makes sense to move that newer version in master. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark >>>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. >>>>>>>>>>>>>>>>>>> Personally, I don’t >>>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they >>>>>>>>>>>>>>>>>>> want new features. >>>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this >>>>>>>>>>>>>>>>>>> approach too that we >>>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new >>>>>>>>>>>>>>>>>>> features would also >>>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with >>>>>>>>>>>>>>>>>>> that and that >>>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I >>>>>>>>>>>>>>>>>>> want to find a >>>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use >>>>>>>>>>>>>>>>>>> for the users. >>>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be >>>>>>>>>>>>>>>>>>> sufficient but our Spark >>>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single >>>>>>>>>>>>>>>>>>> module. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code >>>>>>>>>>>>>>>>>>> cross reference and >>>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all >>>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 >>>>>>>>>>>>>>>>>>> latest versions in a >>>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are >>>>>>>>>>>>>>>>>>> unwilling to >>>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version >>>>>>>>>>>>>>>>>>> branches. If >>>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes >>>>>>>>>>>>>>>>>>> sense to move >>>>>>>>>>>>>>>>>>> that newer version in master. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the >>>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to >>>>>>>>>>>>>>>>>>> all other >>>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that >>>>>>>>>>>>>>>>>>> would slow down >>>>>>>>>>>>>>>>>>> its development speed. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> So my thinking is closer to option 1. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi < >>>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It >>>>>>>>>>>>>>>>>>>> is great to support older versions but because we compile >>>>>>>>>>>>>>>>>>>> against 3.0, we >>>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer >>>>>>>>>>>>>>>>>>>> versions. >>>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot >>>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 >>>>>>>>>>>>>>>>>>>> tables, required >>>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features >>>>>>>>>>>>>>>>>>>> are too important >>>>>>>>>>>>>>>>>>>> to ignore them. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for >>>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of >>>>>>>>>>>>>>>>>>>> the 3.2 features. >>>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us >>>>>>>>>>>>>>>>>>>> internally and would >>>>>>>>>>>>>>>>>>>> love to share that with the rest of the community. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I see two options to move forward: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Option 1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a >>>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no >>>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is >>>>>>>>>>>>>>>>>>>> actively >>>>>>>>>>>>>>>>>>>> maintained. >>>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to >>>>>>>>>>>>>>>>>>>> master could also work with older Spark versions but all >>>>>>>>>>>>>>>>>>>> 0.12 releases will >>>>>>>>>>>>>>>>>>>> only contain bug fixes. Therefore, users will be forced to >>>>>>>>>>>>>>>>>>>> migrate to Spark >>>>>>>>>>>>>>>>>>>> 3.2 to consume any new Spark or format features. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Option 2 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and >>>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can >>>>>>>>>>>>>>>>>>>> support as many Spark versions as needed. >>>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more >>>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core >>>>>>>>>>>>>>>>>>>> format to consume any >>>>>>>>>>>>>>>>>>>> changes in the Spark integration. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but >>>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format >>>>>>>>>>>>>>>>>>>> more frequently >>>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) >>>>>>>>>>>>>>>>>>>> and the overall >>>>>>>>>>>>>>>>>>>> Spark development may be slower. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this >>>>>>>>>>>>>>>>>>>> matter. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Anton >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Tabular >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Tabular >>>>>>>>> >>>>>>>>> >>>>>>>>> >> >> -- >> Ryan Blue >> Tabular >> >