Re: Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Enrico Olivelli Tue, 20 Jul 2021 05:48:29 -0700

I have filed a PR with an update to the Release notes for 2.8.0
https://github.com/apache/pulsar/pull/11392


Thank you all for your feedback
Enrico

Il giorno mar 20 lug 2021 alle ore 00:54 Neng Lu <nl...@apache.org> ha
scritto:

> Based on my local test, it's fine for String Schema.
>
> On 2021/07/19 18:47:49 Devin Bost wrote:
> > > This leads to an IncompatibleClassChangeError  when you have a
> Function or
> > > a Connector that is using Schema.JSON(Pojo.class)
> >
> > I just noticed this detail. Do we have a sense of how often people are
> > using Schema.JSON in Functions/Connectors?
> > Most of our functions are using a string schema, so it's not clear to me
> if
> > they would be impacted.
> >
> > Devin G. Bost
> >
> >
> > On Mon, Jul 19, 2021 at 12:41 PM Devin Bost <devin.b...@gmail.com>
> wrote:
> >
> > > > I think Sijie is referring to using KubernetesRuntime to deploy
> functions
> > > > where each function/source/sink runs as an independent statefulset in
> > > K8s.
> > > > In this scenario, it is possible to have fine grained control over
> which
> > > > version of the function container the function is using.
> > >
> > > Not everybody is using the KubernetesRuntime yet (especially since the
> > > Helm charts aren't feature-complete), and it appears that those who
> aren't
> > > running KubernetesRuntime would be impacted the most by this issue.
> > >
> > > Devin G. Bost
> > >
> > >
> > > On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <devin.b...@gmail.com>
> wrote:
> > >
> > >> > For example, if you are upgrading Flink from one version to the
> other
> > >> > version, you have to make a save point in the previous version for
> all
> > >> > the Flink jobs.
> > >> > Upgrade the Flink cluster and resume jobs in a new version.
> > >> >
> > >> >
> > >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/
> > >> >
> > >> > So it is not unreasonable for asking people to do that when dealing
> > >> > with upgrading a centralized computing engine.
> > >>
> > >> One difference with Flink is that organizations running Flink in job
> mode
> > >> or application mode can upgrade jobs independently of one another, so
> teams
> > >> can upgrade jobs when they are ready without impacting other teams.
> In the
> > >> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster
> would
> > >> break every tenant simultaneously and would block the flow of all
> messages
> > >> until all functions are upgraded. If one team takes a year to upgrade
> their
> > >> one function, the cluster could not be upgraded until that happened.
> Also,
> > >> after all the functions have been upgraded, there would be production
> > >> downtime while deploying all the upgraded functions, which would be a
> major
> > >> outage... It might be possible to write a script to speed up the
> deployment
> > >> to shrink the outage window, but there's currently a bug that wipes
> out
> > >> existing userConfigs when a function is upgraded, so that adds to the
> > >> complexity of upgrading all the functions since someone would need to
> know
> > >> all the userConfigs for all the functions.
> > >>
> > >> So, I don't think we're really comparing the same things here.
> > >>
> > >> Devin G. Bost
> > >>
> > >>
> > >> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <guosi...@gmail.com>
> wrote:
> > >>
> > >>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <
> jerry.boyang.p...@gmail.com>
> > >>> wrote:
> > >>> >
> > >>> > I agree that the best we can do right now is to just clearly
> document
> > >>> this
> > >>> > as a potential problem when updating 2.7 to 2.8.
> > >>> >
> > >>> > We should definitely make every attempt to not make BC breaking
> > >>> changes.
> > >>> > However, there are times when we have to make these tough decisions
> > >>> for one
> > >>> > reason or another. The bigger problem I see here is not
> necessarily a
> > >>> BC
> > >>> > breaking change occurred, but rather we didn't know about it
> > >>> beforehand so
> > >>> > we can clearly document this caveat when 2.8 is released.  Perhaps
> > >>> this is
> > >>> > where we can improve our backwards compatibility testing.  We
> already
> > >>> have
> > >>> > some but probably not enough as highlighted by this case.
> > >>> >
> > >>> > In regards to
> > >>> >
> > >>> > This is partially correct, because you can wait to upgrade the
> workers
> > >>> pod,
> > >>> > > but there is no fine grained control over which version  of each
> pod
> > >>> will
> > >>> > > be running your function, especially in a big cluster with many
> > >>> tenants and
> > >>> > > functions with this problem
> > >>> > >
> > >>> >
> > >>> >
> > >>> > I think Sijie is referring to using KubernetesRuntime to deploy
> > >>> functions
> > >>> > where each function/source/sink runs as an independent statefulset
> in
> > >>> K8s.
> > >>> > In this scenario, it is possible to have fine grained control over
> > >>> which
> > >>> > version of the function container the function is using.  There
> > >>> currently
> > >>> > might not be tools to easily allow users to do this but using
> kubectl
> > >>> one
> > >>> > can definitely determine which container version is running and
> > >>> potentially
> > >>> > update the container version on a per function basis.
> > >>>
> > >>> Jerry - Thank you! That was what I meant.
> > >>>
> > >>> >
> > >>> > Best,
> > >>> >
> > >>> > Jerry
> > >>> >
> > >>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <
> eolive...@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> > > Sijie,
> > >>> > > Thank you for your feedback
> > >>> > > Some additional considerations inline
> > >>> > >
> > >>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <guosi...@gmail.com> ha
> scritto:
> > >>> > >
> > >>> > > > I don't think this is a big problem. Because people can
> recompile
> > >>> the
> > >>> > > > function and submit the function. Most of the
> computing/streaming
> > >>> > > > engines ask users to recompile the jobs and resubmit the jobs
> when
> > >>> it
> > >>> > > > upgrades to a new version.
> > >>> > >
> > >>> > >
> > >>> > > Unfortunately this is not easily feasible if the org that is
> > >>> managing the
> > >>> > > Pulsar service is different from the org who is developing the
> > >>> Functions.
> > >>> > > And especially it is quite impossible to prevent service
> > >>> interruption.
> > >>> > >
> > >>> > > BTW I believe that there is no way to fix this at this point.
> > >>> > >
> > >>> > > The best approach here is to document this
> > >>> > > > behavior.
> > >>> > > >
> > >>> > >
> > >>> > > I agree that the best thing we can do is to document this
> > >>> requirement.
> > >>> > >
> > >>> > > Therefore we must ensure in the future that we won't fall again
> into
> > >>> this
> > >>> > > kind of issues.
> > >>> > >
> > >>> > > Pulsar is becoming more and more used by large enterprises and
> > >>> backward
> > >>> > > compatibility is a big value.
> > >>> > >
> > >>> > > Fortunately not all the Functions need rebuilding.
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > > Also, if you are using Kubernetes runtime to schedule
> functions,
> > >>> you
> > >>> > > > are not really impacted.
> > >>> > > >
> > >>> > >
> > >>> > > This is partially correct, because you can wait to upgrade the
> > >>> workers pod,
> > >>> > > but there is no fine grained control over which version  of each
> pod
> > >>> will
> > >>> > > be running your function, especially in a big cluster with many
> > >>> tenants and
> > >>> > > functions with this problem
> > >>> > >
> > >>> > >
> > >>> > > Enrico
> > >>> > >
> > >>> > >
> > >>> > > > - Sijie
> > >>> > > >
> > >>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <
> > >>> eolive...@gmail.com>
> > >>> > > > wrote:
> > >>> > > > >
> > >>> > > > > Hello,
> > >>> > > > > I have reported this issue [1] about upgrading from Pulsar
> 2.7
> > >>> to 2.8.
> > >>> > > > > More information is on the ticket, but the short version of
> the
> > >>> story
> > >>> > > is
> > >>> > > > > that
> > >>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema
> API,
> > >>> by
> > >>> > > > > switching SchemaInfo from a class to an interface.
> > >>> > > > >
> > >>> > > > > This leads to an IncompatibleClassChangeError  when you have
> a
> > >>> Function
> > >>> > > > or
> > >>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you
> > >>> upgrade your
> > >>> > > > > Pulsar cluster (the functions worker pod for instance) from
> > >>> Pulsar
> > >>> > > 2.7.x
> > >>> > > > to
> > >>> > > > > Pulsar 2.8.0.
> > >>> > > > >
> > >>> > > > > The bad problem is that you cannot upgrade Pulsar without
> > >>> interrupting
> > >>> > > > the
> > >>> > > > > service and coordinating with the upgrade of the Functions.
> > >>> > > > > Your functions need to be recompiled against the Pulsar 2.8
> API
> > >>> and
> > >>> > > > > deployed again in production.
> > >>> > > > >
> > >>> > > > > I have tried to move back SchemaInfo to an "abstract class"
> but
> > >>> without
> > >>> > > > > success, because then you fall into errors.
> > >>> > > > >
> > >>> > > > > I am not sure there is a way to provide a good "upgrade
> path" for
> > >>> > > > > Functions/IO users.
> > >>> > > > >
> > >>> > > > > If we do not find a way we have to document the upgrade in
> the
> > >>> official
> > >>> > > > > Pulsar Documentation.
> > >>> > > > >
> > >>> > > > > We must do our best to prevent users from falling again into
> > >>> this bad
> > >>> > > > > situation.
> > >>> > > > >
> > >>> > > > > Any suggestions or thoughts ?
> > >>> > > > >
> > >>> > > > > Regards
> > >>> > > > > Enrico
> > >>> > > > >
> > >>> > > > > [1] https://github.com/apache/pulsar/issues/11338
> > >>> > > >
> > >>> > >
> > >>>
> > >>
> >
>

Re: Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Reply via email to