Re: [DISCUSS] Spark version support strategy

OpenInx Wed, 15 Sep 2021 02:58:52 -0700

Thanks for bringing this up,  Anton.

Everyone has great pros/cons to support their preferences.  Before giving
my preference, let me raise one question:    what's the top priority thing
for apache iceberg project at this point in time ?  This question will help
us to answer the following question: Should we support more engine versions
more robustly or be a bit more aggressive and concentrate on getting the
new features that users need most in order to keep the project more
competitive ?


If people watch the apache iceberg project and check the issues &
PR frequently,  I guess more than 90% people will answer the priority
question:   There is no doubt for making the whole v2 story to be
production-ready.   The current roadmap discussion also proofs the thing :
https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
.

In order to ensure the highest priority at this point in time, I will
prefer option-1 to reduce the cost of engine maintenance, so as to free up
resources to make v2 production-ready.

On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao <sai.sai.s...@gmail.com> wrote:

> From Dev's point, it has less burden to always support the latest version
> of Spark (for example). But from user's point, especially for us who
> maintain Spark internally, it is not easy to upgrade the Spark version for
> the first time (since we have many customizations internally), and we're
> still promoting to upgrade to 3.1.2. If the community ditches the support
> of old version of Spark3, users have to maintain it themselves unavoidably.
>
> So I'm inclined to make this support in community, not by users
> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
> burden, we could support limited versions of Spark (for example 2 versions).
>
> Just my two cents.
>
> -Saisai
>
>
> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道：
>
>> Hi Wing Yew,
>>
>> I think 2.4 is a different story, we will continue to support Spark 2.4,
>> but as you can see it will continue to have very limited functionalities
>> comparing to Spark 3. I believe we discussed about option 3 when we were
>> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
>> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
>> around this, let's take this chance to make a good community guideline for
>> all future engine versions, especially for Spark, Flink and Hive that are
>> in the same repository.
>>
>> I can totally understand your point of view Wing, in fact, speaking from
>> the perspective of AWS EMR, we have to support over 40 versions of the
>> software because there are people who are still using Spark 1.4, believe it
>> or not. After all, keep backporting changes will become a liability not
>> only on the user side, but also on the service provider side, so I believe
>> it's not a bad practice to push for user upgrade, as it will make the life
>> of both parties easier in the end. New feature is definitely one of the
>> best incentives to promote an upgrade on user side.
>>
>> I think the biggest issue of option 3 is about its scalability, because
>> we will have an unbounded list of packages to add and compile in the
>> future, and we probably cannot drop support of that package once created.
>> If we go with option 1, I think we can still publish a few patch versions
>> for old Iceberg releases, and committers can control the amount of patch
>> versions to guard people from abusing the power of patching. I see this as
>> a consistent strategy also for Flink and Hive. With this strategy, we can
>> truly have a compatibility matrix for engine versions against Iceberg
>> versions.
>>
>> -Jack
>>
>>
>>
>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>> <wyp...@cloudera.com.invalid> wrote:
>>
>>> I understand and sympathize with the desire to use new DSv2 features in
>>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>>> think it considers the interests of users. I do not think that most users
>>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>> 2.4?
>>>
>>> Please correct me if I'm mistaken, but the folks who have spoken out in
>>> favor of Option 1 all work for the same organization, don't they? And they
>>> don't have a problem with making their users, all internal, simply upgrade
>>> to Spark 3.2, do they? (Or they are already running an internal fork that
>>> is close to 3.2.)
>>>
>>> I work for an organization with customers running different versions of
>>> Spark. It is true that we can backport new features to older versions if we
>>> wanted to. I suppose the people contributing to Iceberg work for some
>>> organization or other that either use Iceberg in-house, or provide software
>>> (possibly in the form of a service) to customers, and either way, the
>>> organizations have the ability to backport features and fixes to internal
>>> versions. Are there any users out there who simply use Apache Iceberg and
>>> depend on the community version?
>>>
>>> There may be features that are broadly useful that do not depend on
>>> Spark 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>>
>>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>>> modules you're thinking of?
>>>
>>> - Wing Yew
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>
>>>> Option 1 sounds good to me. Here are my reasons:
>>>>
>>>> 1. Both 2 and 3 will slow down the development. Considering the limited
>>>> resources in the open source community, the upsides of option 2 and 3 are
>>>> probably not worthy.
>>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to
>>>> predict anything, but even if these use cases are legit, users can still
>>>> get the new feature by backporting it to an older version in case of
>>>> upgrading to a newer version isn't an option.
>>>>
>>>> Best,
>>>>
>>>> Yufei
>>>>
>>>> `This is not a contribution`
>>>>
>>>>
>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi
>>>> <aokolnyc...@apple.com.invalid> wrote:
>>>>
>>>>> To sum up what we have so far:
>>>>>
>>>>>
>>>>> *Option 1 (support just the most recent minor Spark 3 version)*
>>>>>
>>>>> The easiest option for us devs, forces the user to upgrade to the most
>>>>> recent minor Spark version to consume any new Iceberg features.
>>>>>
>>>>> *Option 2 (a separate project under Iceberg)*
>>>>>
>>>>> Can support as many Spark versions as needed and the codebase is still
>>>>> separate as we can use separate branches.
>>>>> Impossible to consume any unreleased changes in core, may slow down
>>>>> the development.
>>>>>
>>>>> *Option 3 (separate modules for Spark 3.1/3.2)*
>>>>>
>>>>> Introduce more modules in the same project.
>>>>> Can consume unreleased changes but it will required at least 5 modules
>>>>> to support 2.4, 3.1 and 3.2, making the build and testing complicated.
>>>>>
>>>>>
>>>>> Are there any users for whom upgrading the minor Spark version (e3.1
>>>>> to 3.2) to consume new features is a blocker?
>>>>> We follow Option 1 internally at the moment but I would like to hear
>>>>> what other people think/need.
>>>>>
>>>>> - Anton
>>>>>
>>>>>
>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer <russell.spit...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I think we should go for option 1. I already am not a big fan of
>>>>> having runtime errors for unsupported things based on versions and I don't
>>>>> think minor version upgrades are a large issue for users.  I'm especially
>>>>> not looking forward to supporting interfaces that only exist in Spark 3.2
>>>>> in a multiple Spark version support future.
>>>>>
>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi <
>>>>> aokolnyc...@apple.com.INVALID> wrote:
>>>>>
>>>>> First of all, is option 2 a viable option? We discussed separating the
>>>>> python module outside of the project a few weeks ago, and decided to not 
>>>>> do
>>>>> that because it's beneficial for code cross reference and more intuitive
>>>>> for new developers to see everything in the same repository. I would 
>>>>> expect
>>>>> the same argument to also hold here.
>>>>>
>>>>>
>>>>> That’s exactly the concern I have about Option 2 at this moment.
>>>>>
>>>>> Overall I would personally prefer us to not support all the minor
>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>> version.
>>>>>
>>>>>
>>>>> This is when it gets a bit complicated. If we want to support both
>>>>> Spark 3.1 and Spark 3.2 with a single module, it means we have to compile
>>>>> against 3.1. The problem is that we rely on DSv2 that is being actively
>>>>> developed. 3.2 and 3.1 have substantial differences. On top of that, we
>>>>> have our extensions that are extremely low-level and may break not only
>>>>> between minor versions but also between patch releases.
>>>>>
>>>>> f there are some features requiring a newer version, it makes sense to
>>>>> move that newer version in master.
>>>>>
>>>>>
>>>>> Internally, we don’t deliver new features to older Spark versions as
>>>>> it requires a lot of effort to port things. Personally, I don’t think it 
>>>>> is
>>>>> too bad to require users to upgrade if they want new features. At the same
>>>>> time, there are valid concerns with this approach too that we mentioned
>>>>> during the sync. For example, certain new features would also work fine
>>>>> with older Spark versions. I generally agree with that and that not
>>>>> supporting recent versions is not ideal. However, I want to find a balance
>>>>> between the complexity on our side and ease of use for the users. Ideally,
>>>>> supporting a few recent versions would be sufficient but our Spark
>>>>> integration is too low-level to do that with a single module.
>>>>>
>>>>>
>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>> First of all, is option 2 a viable option? We discussed separating the
>>>>> python module outside of the project a few weeks ago, and decided to not 
>>>>> do
>>>>> that because it's beneficial for code cross reference and more intuitive
>>>>> for new developers to see everything in the same repository. I would 
>>>>> expect
>>>>> the same argument to also hold here.
>>>>>
>>>>> Overall I would personally prefer us to not support all the minor
>>>>> versions, but instead support maybe just 2-3 latest versions in a major
>>>>> version. This avoids the problem that some users are unwilling to move to 
>>>>> a
>>>>> newer version and keep patching old Spark version branches. If there are
>>>>> some features requiring a newer version, it makes sense to move that newer
>>>>> version in master.
>>>>>
>>>>> In addition, because currently Spark is considered the most
>>>>> feature-complete reference implementation compared to all other engines, I
>>>>> think we should not add artificial barriers that would slow down its
>>>>> development speed.
>>>>>
>>>>> So my thinking is closer to option 1.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi <
>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I want to discuss our Spark version support strategy.
>>>>>>
>>>>>> So far, we have tried to support both 3.0 and 3.1. It is great to
>>>>>> support older versions but because we compile against 3.0, we cannot use
>>>>>> any Spark features that are offered in newer versions.
>>>>>> Spark 3.2 is just around the corner and it brings a lot of important
>>>>>> features such dynamic filtering for v2 tables, required distribution and
>>>>>> ordering for writes, etc. These features are too important to ignore 
>>>>>> them.
>>>>>>
>>>>>> Apart from that, I have an end-to-end prototype for merge-on-read
>>>>>> with Spark that actually leverages some of the 3.2 features. I’ll be
>>>>>> implementing all new Spark DSv2 APIs for us internally and would love to
>>>>>> share that with the rest of the community.
>>>>>>
>>>>>> I see two options to move forward:
>>>>>>
>>>>>> Option 1
>>>>>>
>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a while by
>>>>>> releasing minor versions with bug fixes.
>>>>>>
>>>>>> Pros: almost no changes to the build configuration, no extra work on
>>>>>> our side as just a single Spark version is actively maintained.
>>>>>> Cons: some new features that we will be adding to master could also
>>>>>> work with older Spark versions but all 0.12 releases will only contain 
>>>>>> bug
>>>>>> fixes. Therefore, users will be forced to migrate to Spark 3.2 to consume
>>>>>> any new Spark or format features.
>>>>>>
>>>>>> Option 2
>>>>>>
>>>>>> Move our Spark integration into a separate project and introduce
>>>>>> branches for 3.0, 3.1 and 3.2.
>>>>>>
>>>>>> Pros: decouples the format version from Spark, we can support as many
>>>>>> Spark versions as needed.
>>>>>> Cons: more work initially to set everything up, more work to release,
>>>>>> will need a new release of the core format to consume any changes in the
>>>>>> Spark integration.
>>>>>>
>>>>>> Overall, I think option 2 seems better for the user but my main worry
>>>>>> is that we will have to release the format more frequently (which is a 
>>>>>> good
>>>>>> thing but requires more work and time) and the overall Spark development
>>>>>> may be slower.
>>>>>>
>>>>>> I’d love to hear what everybody thinks about this matter.
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Spark version support strategy

Reply via email to