Re: [DISCUSS] Spark version support strategy

Wing Yew Poon Wed, 15 Sep 2021 14:47:01 -0700

IIUC, Option 2 is to move the Spark support for Iceberg into a separate
repo (subproject of Iceberg). Would we have branches such as 0.13-2.4,
0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all
versions or all Spark 3 versions, then we would need to commit the changes
to all applicable branches. Basically we are trading more work to commit to
multiple branches for simplified build and CI time per branch, which might
be an acceptable trade-off. However, the biggest downside is that changes
may need to be made in core Iceberg as well as in the engine (in this case
Spark) support, and we need to wait for a release of core Iceberg to
consume the changes in the subproject. In this case, maybe we should have a
monthly release of core Iceberg (no matter how many changes go in, as long
as it is non-zero) so that the subproject can consume changes fairly
quickly?



On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> wrote:

> Thanks for bringing this up, Anton. I’m glad that we have the set of
> potential solutions well defined.
>
> Looks like the next step is to decide whether we want to require people to
> update Spark versions to pick up newer versions of Iceberg. If we choose to
> make people upgrade, then option 1 is clearly the best choice.
>
> I don’t think that we should make updating Spark a requirement. Many of
> the things that we’re working on are orthogonal to Spark versions, like
> table maintenance actions, secondary indexes, the 1.0 API, views, ORC
> delete files, new storage implementations, etc. Upgrading Spark is time
> consuming and untrusted in my experience, so I think we would be setting up
> an unnecessary trade-off between spending lots of time to upgrade Spark and
> picking up new Iceberg features.
>
> Another way of thinking about this is that if we went with option 1, then
> we could port bug fixes into 0.12.x. But there are many things that
> wouldn’t fit this model, like adding a FileIO implementation for ADLS. So
> some people in the community would have to maintain branches of newer
> Iceberg versions with older versions of Spark outside of the main Iceberg
> project — that defeats the purpose of simplifying things with option 1
> because we would then have more people maintaining the same 0.13.x with
> Spark 3.1 branch. (This reminds me of the Spark community, where we wanted
> to release a 2.5 line with DSv2 backported, but the community decided not
> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, etc.)
>
> If the community is going to do the work anyway — and I think some of us
> would — we should make it possible to share that work. That’s why I don’t
> think that we should go with option 1.
>
> If we don’t go with option 1, then the choice is how to maintain multiple
> Spark versions. I think that the way we’re doing it right now is not
> something we want to continue.
>
> Using multiple modules (option 3) is concerning to me because of the
> changes in Spark. We currently structure the library to share as much code
> as possible. But that means compiling against different Spark versions and
> relying on binary compatibility and reflection in some cases. To me, this
> seems unmaintainable in the long run because it requires refactoring common
> classes and spending a lot of time deduplicating code. It also creates a
> ton of modules, at least one common module, then a module per version, then
> an extensions module per version, and finally a runtime module per version.
> That’s 3 modules per Spark version, plus any new common modules. And each
> module needs to be tested, which is making our CI take a really long time.
> We also don’t support multiple Scala versions, which is another gap that
> will require even more modules and tests.
>
> I like option 2 because it would allow us to compile against a single
> version of Spark (which will be much more reliable). It would give us an
> opportunity to support different Scala versions. It avoids the need to
> refactor to share code and allows people to focus on a single version of
> Spark, while also creating a way for people to maintain and update the
> older versions with newer Iceberg releases. I don’t think that this would
> slow down development. I think it would actually speed it up because we’d
> be spending less time trying to make multiple versions work in the same
> build. And anyone in favor of option 1 would basically get option 1: you
> don’t have to care about branches for older Spark versions.
>
> Jack makes a good point about wanting to keep code in a single repository,
> but I think that the need to manage more version combinations overrides
> this concern. It’s easier to make this decision in python because we’re not
> trying to depend on two projects that change relatively quickly. We’re just
> trying to build a library.
>
> Ryan
>
> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> wrote:
>
>> Thanks for bringing this up,  Anton.
>>
>> Everyone has great pros/cons to support their preferences.  Before giving
>> my preference, let me raise one question:    what's the top priority thing
>> for apache iceberg project at this point in time ?  This question will help
>> us to answer the following question: Should we support more engine versions
>> more robustly or be a bit more aggressive and concentrate on getting the
>> new features that users need most in order to keep the project more
>> competitive ?
>>
>> If people watch the apache iceberg project and check the issues &
>> PR frequently,  I guess more than 90% people will answer the priority
>> question:   There is no doubt for making the whole v2 story to be
>> production-ready.   The current roadmap discussion also proofs the thing :
>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
>> .
>>
>> In order to ensure the highest priority at this point in time, I will
>> prefer option-1 to reduce the cost of engine maintenance, so as to free up
>> resources to make v2 production-ready.
>>
>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.s...@gmail.com>
>> wrote:
>>
>>> From Dev's point, it has less burden to always support the latest
>>> version of Spark (for example). But from user's point, especially for us
>>> who maintain Spark internally, it is not easy to upgrade the Spark version
>>> for the first time (since we have many customizations internally), and
>>> we're still promoting to upgrade to 3.1.2. If the community ditches the
>>> support of old version of Spark3, users have to maintain it themselves
>>> unavoidably.
>>>
>>> So I'm inclined to make this support in community, not by users
>>> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
>>> burden, we could support limited versions of Spark (for example 2 versions).
>>>
>>> Just my two cents.
>>>
>>> -Saisai
>>>
>>>
>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>>>
>>>> Hi Wing Yew,
>>>>
>>>> I think 2.4 is a different story, we will continue to support Spark
>>>> 2.4, but as you can see it will continue to have very limited
>>>> functionalities comparing to Spark 3. I believe we discussed about option 3
>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the
>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we need a
>>>> consistent strategy around this, let's take this chance to make a good
>>>> community guideline for all future engine versions, especially for Spark,
>>>> Flink and Hive that are in the same repository.
>>>>
>>>> I can totally understand your point of view Wing, in fact, speaking
>>>> from the perspective of AWS EMR, we have to support over 40 versions of the
>>>> software because there are people who are still using Spark 1.4, believe it
>>>> or not. After all, keep backporting changes will become a liability not
>>>> only on the user side, but also on the service provider side, so I believe
>>>> it's not a bad practice to push for user upgrade, as it will make the life
>>>> of both parties easier in the end. New feature is definitely one of the
>>>> best incentives to promote an upgrade on user side.
>>>>
>>>> I think the biggest issue of option 3 is about its scalability, because
>>>> we will have an unbounded list of packages to add and compile in the
>>>> future, and we probably cannot drop support of that package once created.
>>>> If we go with option 1, I think we can still publish a few patch versions
>>>> for old Iceberg releases, and committers can control the amount of patch
>>>> versions to guard people from abusing the power of patching. I see this as
>>>> a consistent strategy also for Flink and Hive. With this strategy, we can
>>>> truly have a compatibility matrix for engine versions against Iceberg
>>>> versions.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>>>> <wyp...@cloudera.com.invalid> wrote:
>>>>
>>>>> I understand and sympathize with the desire to use new DSv2 features
>>>>> in Spark 3.2. I agree that Option 1 is the easiest for developers, but I
>>>>> don't think it considers the interests of users. I do not think that most
>>>>> users will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>>>> 2.4?
>>>>>
>>>>> Please correct me if I'm mistaken, but the folks who have spoken out
>>>>> in favor of Option 1 all work for the same organization, don't they? And
>>>>> they don't have a problem with making their users, all internal, simply
>>>>> upgrade to Spark 3.2, do they? (Or they are already running an internal
>>>>> fork that is close to 3.2.)
>>>>>
>>>>> I work for an organization with customers running different versions
>>>>> of Spark. It is true that we can backport new features to older versions 
>>>>> if
>>>>> we wanted to. I suppose the people contributing to Iceberg work for some
>>>>> organization or other that either use Iceberg in-house, or provide 
>>>>> software
>>>>> (possibly in the form of a service) to customers, and either way, the
>>>>> organizations have the ability to backport features and fixes to internal
>>>>> versions. Are there any users out there who simply use Apache Iceberg and
>>>>> depend on the community version?
>>>>>
>>>>> There may be features that are broadly useful that do not depend on
>>>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>>>
>>>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>>>> consider Option 3 too. Anton, you said 5 modules are required; what are 
>>>>> the
>>>>> modules you're thinking of?
>>>>>
>>>>> - Wing Yew
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>>
>>>>>> Option 1 sounds good to me. Here are my reasons:
>>>>>>
>>>>>> 1. Both 2 and 3 will slow down the development. Considering the
>>>>>> limited resources in the open source community, the upsides of option 2 
>>>>>> and
>>>>>> 3 are probably not worthy.
>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>>>> predict anything, but even if these use cases are legit, users can still
>>>>>> get the new feature by backporting it to an older version in case of
>>>>>> upgrading to a newer version isn't an option.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Yufei
>>>>>>
>>>>>> `This is not a contribution`
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>>>
>>>>>>> To sum up what we have so far:
>>>>>>>
>>>>>>>
>>>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>>>
>>>>>>> The easiest option for us devs, forces the user to upgrade to the
>>>>>>> most recent minor Spark version to consume any new Iceberg features.
>>>>>>>
>>>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>>>
>>>>>>> Can support as many Spark versions as needed and the codebase is
>>>>>>> still separate as we can use separate branches.
>>>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>>>> the development.
>>>>>>>
>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>>>
>>>>>>> Introduce more modules in the same project.
>>>>>>> Can consume unreleased changes but it will required at least 5
>>>>>>> modules to support 2.4, 3.1 and 3.2, making the build and testing
>>>>>>> complicated.
>>>>>>>
>>>>>>>
>>>>>>> Are there any users for whom upgrading the minor Spark version (e3.1
>>>>>>> to 3.2) to consume new features is a blocker?
>>>>>>> We follow Option 1 internally at the moment but I would like to hear
>>>>>>> what other people think/need.
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>>
>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <russell.spit...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I think we should go for option 1. I already am not a big fan of
>>>>>>> having runtime errors for unsupported things based on versions and I 
>>>>>>> don't
>>>>>>> think minor version upgrades are a large issue for users.  I'm 
>>>>>>> especially
>>>>>>> not looking forward to supporting interfaces that only exist in Spark 
>>>>>>> 3.2
>>>>>>> in a multiple Spark version support future.
>>>>>>>
>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>>>
>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>> would expect the same argument to also hold here.
>>>>>>>
>>>>>>>
>>>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>>>
>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>> version.
>>>>>>>
>>>>>>>
>>>>>>> This is when it gets a bit complicated. If we want to support both
>>>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to 
>>>>>>> compile
>>>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>>>> have our extensions that are extremely low-level and may break not only
>>>>>>> between minor versions but also between patch releases.
>>>>>>>
>>>>>>> f there are some features requiring a newer version, it makes sense
>>>>>>> to move that newer version in master.
>>>>>>>
>>>>>>>
>>>>>>> Internally, we don’t deliver new features to older Spark versions as
>>>>>>> it requires a lot of effort to port things. Personally, I don’t think 
>>>>>>> it is
>>>>>>> too bad to require users to upgrade if they want new features. At the 
>>>>>>> same
>>>>>>> time, there are valid concerns with this approach too that we mentioned
>>>>>>> during the sync. For example, certain new features would also work fine
>>>>>>> with older Spark versions. I generally agree with that and that not
>>>>>>> supporting recent versions is not ideal. However, I want to find a 
>>>>>>> balance
>>>>>>> between the complexity on our side and ease of use for the users. 
>>>>>>> Ideally,
>>>>>>> supporting a few recent versions would be sufficient but our Spark
>>>>>>> integration is too low-level to do that with a single module.
>>>>>>>
>>>>>>>
>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> wrote:
>>>>>>>
>>>>>>> First of all, is option 2 a viable option? We discussed separating
>>>>>>> the python module outside of the project a few weeks ago, and decided to
>>>>>>> not do that because it's beneficial for code cross reference and more
>>>>>>> intuitive for new developers to see everything in the same repository. I
>>>>>>> would expect the same argument to also hold here.
>>>>>>>
>>>>>>> Overall I would personally prefer us to not support all the minor
>>>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>>>> version. This avoids the problem that some users are unwilling to move 
>>>>>>> to a
>>>>>>> newer version and keep patching old Spark version branches. If there are
>>>>>>> some features requiring a newer version, it makes sense to move that 
>>>>>>> newer
>>>>>>> version in master.
>>>>>>>
>>>>>>> In addition, because currently Spark is considered the most
>>>>>>> feature-complete reference implementation compared to all other 
>>>>>>> engines, I
>>>>>>> think we should not add artificial barriers that would slow down its
>>>>>>> development speed.
>>>>>>>
>>>>>>> So my thinking is closer to option 1.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jack Ye
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I want to discuss our Spark version support strategy.
>>>>>>>>
>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>>>> support older versions but because we compile against 3.0, we cannot 
>>>>>>>> use
>>>>>>>> any Spark features that are offered in newer versions.
>>>>>>>> Spark 3.2 is just around the corner and it brings a lot of
>>>>>>>> important features such dynamic filtering for v2 tables, required
>>>>>>>> distribution and ordering for writes, etc. These features are too 
>>>>>>>> important
>>>>>>>> to ignore them.
>>>>>>>>
>>>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>>>> implementing all new Spark DSv2 APIs for us internally and would love 
>>>>>>>> to
>>>>>>>> share that with the rest of the community.
>>>>>>>>
>>>>>>>> I see two options to move forward:
>>>>>>>>
>>>>>>>> Option 1
>>>>>>>>
>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>>>> releasing minor versions with bug fixes.
>>>>>>>>
>>>>>>>> Pros: almost no changes to the build configuration, no extra work
>>>>>>>> on our side as just a single Spark version is actively maintained.
>>>>>>>> Cons: some new features that we will be adding to master could also
>>>>>>>> work with older Spark versions but all 0.12 releases will only contain 
>>>>>>>> bug
>>>>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to 
>>>>>>>> consume
>>>>>>>> any new Spark or format features.
>>>>>>>>
>>>>>>>> Option 2
>>>>>>>>
>>>>>>>> Move our Spark integration into a separate project and introduce
>>>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>>>
>>>>>>>> Pros: decouples the format version from Spark, we can support as
>>>>>>>> many Spark versions as needed.
>>>>>>>> Cons: more work initially to set everything up, more work to
>>>>>>>> release, will need a new release of the core format to consume any 
>>>>>>>> changes
>>>>>>>> in the Spark integration.
>>>>>>>>
>>>>>>>> Overall, I think option 2 seems better for the user but my main
>>>>>>>> worry is that we will have to release the format more frequently 
>>>>>>>> (which is
>>>>>>>> a good thing but requires more work and time) and the overall Spark
>>>>>>>> development may be slower.
>>>>>>>>
>>>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Spark version support strategy

Reply via email to