Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Devin Bost Mon, 19 Jul 2021 11:42:13 -0700

> I think Sijie is referring to using KubernetesRuntime to deploy functions
> where each function/source/sink runs as an independent statefulset in K8s.
> In this scenario, it is possible to have fine grained control over which
> version of the function container the function is using.


Not everybody is using the KubernetesRuntime yet (especially since the Helm
charts aren't feature-complete), and it appears that those who aren't
running KubernetesRuntime would be impacted the most by this issue.

Devin G. Bost


On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <[email protected]> wrote:

> > For example, if you are upgrading Flink from one version to the other
> > version, you have to make a save point in the previous version for all
> > the Flink jobs.
> > Upgrade the Flink cluster and resume jobs in a new version.
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/
> >
> > So it is not unreasonable for asking people to do that when dealing
> > with upgrading a centralized computing engine.
>
> One difference with Flink is that organizations running Flink in job mode
> or application mode can upgrade jobs independently of one another, so teams
> can upgrade jobs when they are ready without impacting other teams. In the
> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would
> break every tenant simultaneously and would block the flow of all messages
> until all functions are upgraded. If one team takes a year to upgrade their
> one function, the cluster could not be upgraded until that happened. Also,
> after all the functions have been upgraded, there would be production
> downtime while deploying all the upgraded functions, which would be a major
> outage... It might be possible to write a script to speed up the deployment
> to shrink the outage window, but there's currently a bug that wipes out
> existing userConfigs when a function is upgraded, so that adds to the
> complexity of upgrading all the functions since someone would need to know
> all the userConfigs for all the functions.
>
> So, I don't think we're really comparing the same things here.
>
> Devin G. Bost
>
>
> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <[email protected]> wrote:
>
>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <[email protected]>
>> wrote:
>> >
>> > I agree that the best we can do right now is to just clearly document
>> this
>> > as a potential problem when updating 2.7 to 2.8.
>> >
>> > We should definitely make every attempt to not make BC breaking changes.
>> > However, there are times when we have to make these tough decisions for
>> one
>> > reason or another. The bigger problem I see here is not necessarily a BC
>> > breaking change occurred, but rather we didn't know about it beforehand
>> so
>> > we can clearly document this caveat when 2.8 is released.  Perhaps this
>> is
>> > where we can improve our backwards compatibility testing.  We already
>> have
>> > some but probably not enough as highlighted by this case.
>> >
>> > In regards to
>> >
>> > This is partially correct, because you can wait to upgrade the workers
>> pod,
>> > > but there is no fine grained control over which version  of each pod
>> will
>> > > be running your function, especially in a big cluster with many
>> tenants and
>> > > functions with this problem
>> > >
>> >
>> >
>> > I think Sijie is referring to using KubernetesRuntime to deploy
>> functions
>> > where each function/source/sink runs as an independent statefulset in
>> K8s.
>> > In this scenario, it is possible to have fine grained control over which
>> > version of the function container the function is using.  There
>> currently
>> > might not be tools to easily allow users to do this but using kubectl
>> one
>> > can definitely determine which container version is running and
>> potentially
>> > update the container version on a per function basis.
>>
>> Jerry - Thank you! That was what I meant.
>>
>> >
>> > Best,
>> >
>> > Jerry
>> >
>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <[email protected]>
>> > wrote:
>> >
>> > > Sijie,
>> > > Thank you for your feedback
>> > > Some additional considerations inline
>> > >
>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <[email protected]> ha scritto:
>> > >
>> > > > I don't think this is a big problem. Because people can recompile
>> the
>> > > > function and submit the function. Most of the computing/streaming
>> > > > engines ask users to recompile the jobs and resubmit the jobs when
>> it
>> > > > upgrades to a new version.
>> > >
>> > >
>> > > Unfortunately this is not easily feasible if the org that is managing
>> the
>> > > Pulsar service is different from the org who is developing the
>> Functions.
>> > > And especially it is quite impossible to prevent service interruption.
>> > >
>> > > BTW I believe that there is no way to fix this at this point.
>> > >
>> > > The best approach here is to document this
>> > > > behavior.
>> > > >
>> > >
>> > > I agree that the best thing we can do is to document this requirement.
>> > >
>> > > Therefore we must ensure in the future that we won't fall again into
>> this
>> > > kind of issues.
>> > >
>> > > Pulsar is becoming more and more used by large enterprises and
>> backward
>> > > compatibility is a big value.
>> > >
>> > > Fortunately not all the Functions need rebuilding.
>> > >
>> > >
>> > >
>> > >
>> > > > Also, if you are using Kubernetes runtime to schedule functions, you
>> > > > are not really impacted.
>> > > >
>> > >
>> > > This is partially correct, because you can wait to upgrade the
>> workers pod,
>> > > but there is no fine grained control over which version  of each pod
>> will
>> > > be running your function, especially in a big cluster with many
>> tenants and
>> > > functions with this problem
>> > >
>> > >
>> > > Enrico
>> > >
>> > >
>> > > > - Sijie
>> > > >
>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > Hello,
>> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7 to
>> 2.8.
>> > > > > More information is on the ticket, but the short version of the
>> story
>> > > is
>> > > > > that
>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API,
>> by
>> > > > > switching SchemaInfo from a class to an interface.
>> > > > >
>> > > > > This leads to an IncompatibleClassChangeError  when you have a
>> Function
>> > > > or
>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you upgrade
>> your
>> > > > > Pulsar cluster (the functions worker pod for instance) from Pulsar
>> > > 2.7.x
>> > > > to
>> > > > > Pulsar 2.8.0.
>> > > > >
>> > > > > The bad problem is that you cannot upgrade Pulsar without
>> interrupting
>> > > > the
>> > > > > service and coordinating with the upgrade of the Functions.
>> > > > > Your functions need to be recompiled against the Pulsar 2.8 API
>> and
>> > > > > deployed again in production.
>> > > > >
>> > > > > I have tried to move back SchemaInfo to an "abstract class" but
>> without
>> > > > > success, because then you fall into errors.
>> > > > >
>> > > > > I am not sure there is a way to provide a good "upgrade path" for
>> > > > > Functions/IO users.
>> > > > >
>> > > > > If we do not find a way we have to document the upgrade in the
>> official
>> > > > > Pulsar Documentation.
>> > > > >
>> > > > > We must do our best to prevent users from falling again into this
>> bad
>> > > > > situation.
>> > > > >
>> > > > > Any suggestions or thoughts ?
>> > > > >
>> > > > > Regards
>> > > > > Enrico
>> > > > >
>> > > > > [1] https://github.com/apache/pulsar/issues/11338
>> > > >
>> > >
>>
>

Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Reply via email to