Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Enrico Olivelli Mon, 19 Jul 2021 14:15:20 -0700

Il Lun 19 Lug 2021, 20:48 Devin Bost <[email protected]> ha scritto:


> > This leads to an IncompatibleClassChangeError  when you have a Function
> or
> > a Connector that is using Schema.JSON(Pojo.class)
>
> I just noticed this detail. Do we have a sense of how often people are
> using Schema.JSON in Functions/Connectors?
>

The case I have found is about a Function that is creating a Pulsar Client
and creates a Producer and thus it needs to call a static method of Schema
(for instance Schema.JSON).

Normally Functions do not behave that way.
But Connectors are more likely to be willing to use those methods,
especially now in 2.9.0 that we are going to give a full PulsarClient

Enrico


Most of our functions are using a string schema, so it's not clear to me if
> they would be impacted.
>
> Devin G. Bost
>
>
> On Mon, Jul 19, 2021 at 12:41 PM Devin Bost <[email protected]> wrote:
>
> > > I think Sijie is referring to using KubernetesRuntime to deploy
> functions
> > > where each function/source/sink runs as an independent statefulset in
> > K8s.
> > > In this scenario, it is possible to have fine grained control over
> which
> > > version of the function container the function is using.
> >
> > Not everybody is using the KubernetesRuntime yet (especially since the
> > Helm charts aren't feature-complete), and it appears that those who
> aren't
> > running KubernetesRuntime would be impacted the most by this issue.
> >
> > Devin G. Bost
> >
> >
> > On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <[email protected]>
> wrote:
> >
> >> > For example, if you are upgrading Flink from one version to the other
> >> > version, you have to make a save point in the previous version for all
> >> > the Flink jobs.
> >> > Upgrade the Flink cluster and resume jobs in a new version.
> >> >
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/
> >> >
> >> > So it is not unreasonable for asking people to do that when dealing
> >> > with upgrading a centralized computing engine.
> >>
> >> One difference with Flink is that organizations running Flink in job
> mode
> >> or application mode can upgrade jobs independently of one another, so
> teams
> >> can upgrade jobs when they are ready without impacting other teams. In
> the
> >> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster
> would
> >> break every tenant simultaneously and would block the flow of all
> messages
> >> until all functions are upgraded. If one team takes a year to upgrade
> their
> >> one function, the cluster could not be upgraded until that happened.
> Also,
> >> after all the functions have been upgraded, there would be production
> >> downtime while deploying all the upgraded functions, which would be a
> major
> >> outage... It might be possible to write a script to speed up the
> deployment
> >> to shrink the outage window, but there's currently a bug that wipes out
> >> existing userConfigs when a function is upgraded, so that adds to the
> >> complexity of upgrading all the functions since someone would need to
> know
> >> all the userConfigs for all the functions.
> >>
> >> So, I don't think we're really comparing the same things here.
> >>
> >> Devin G. Bost
> >>
> >>
> >> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <[email protected]> wrote:
> >>
> >>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <
> [email protected]>
> >>> wrote:
> >>> >
> >>> > I agree that the best we can do right now is to just clearly document
> >>> this
> >>> > as a potential problem when updating 2.7 to 2.8.
> >>> >
> >>> > We should definitely make every attempt to not make BC breaking
> >>> changes.
> >>> > However, there are times when we have to make these tough decisions
> >>> for one
> >>> > reason or another. The bigger problem I see here is not necessarily a
> >>> BC
> >>> > breaking change occurred, but rather we didn't know about it
> >>> beforehand so
> >>> > we can clearly document this caveat when 2.8 is released.  Perhaps
> >>> this is
> >>> > where we can improve our backwards compatibility testing.  We already
> >>> have
> >>> > some but probably not enough as highlighted by this case.
> >>> >
> >>> > In regards to
> >>> >
> >>> > This is partially correct, because you can wait to upgrade the
> workers
> >>> pod,
> >>> > > but there is no fine grained control over which version  of each
> pod
> >>> will
> >>> > > be running your function, especially in a big cluster with many
> >>> tenants and
> >>> > > functions with this problem
> >>> > >
> >>> >
> >>> >
> >>> > I think Sijie is referring to using KubernetesRuntime to deploy
> >>> functions
> >>> > where each function/source/sink runs as an independent statefulset in
> >>> K8s.
> >>> > In this scenario, it is possible to have fine grained control over
> >>> which
> >>> > version of the function container the function is using.  There
> >>> currently
> >>> > might not be tools to easily allow users to do this but using kubectl
> >>> one
> >>> > can definitely determine which container version is running and
> >>> potentially
> >>> > update the container version on a per function basis.
> >>>
> >>> Jerry - Thank you! That was what I meant.
> >>>
> >>> >
> >>> > Best,
> >>> >
> >>> > Jerry
> >>> >
> >>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <
> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > Sijie,
> >>> > > Thank you for your feedback
> >>> > > Some additional considerations inline
> >>> > >
> >>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <[email protected]> ha
> scritto:
> >>> > >
> >>> > > > I don't think this is a big problem. Because people can recompile
> >>> the
> >>> > > > function and submit the function. Most of the computing/streaming
> >>> > > > engines ask users to recompile the jobs and resubmit the jobs
> when
> >>> it
> >>> > > > upgrades to a new version.
> >>> > >
> >>> > >
> >>> > > Unfortunately this is not easily feasible if the org that is
> >>> managing the
> >>> > > Pulsar service is different from the org who is developing the
> >>> Functions.
> >>> > > And especially it is quite impossible to prevent service
> >>> interruption.
> >>> > >
> >>> > > BTW I believe that there is no way to fix this at this point.
> >>> > >
> >>> > > The best approach here is to document this
> >>> > > > behavior.
> >>> > > >
> >>> > >
> >>> > > I agree that the best thing we can do is to document this
> >>> requirement.
> >>> > >
> >>> > > Therefore we must ensure in the future that we won't fall again
> into
> >>> this
> >>> > > kind of issues.
> >>> > >
> >>> > > Pulsar is becoming more and more used by large enterprises and
> >>> backward
> >>> > > compatibility is a big value.
> >>> > >
> >>> > > Fortunately not all the Functions need rebuilding.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > > Also, if you are using Kubernetes runtime to schedule functions,
> >>> you
> >>> > > > are not really impacted.
> >>> > > >
> >>> > >
> >>> > > This is partially correct, because you can wait to upgrade the
> >>> workers pod,
> >>> > > but there is no fine grained control over which version  of each
> pod
> >>> will
> >>> > > be running your function, especially in a big cluster with many
> >>> tenants and
> >>> > > functions with this problem
> >>> > >
> >>> > >
> >>> > > Enrico
> >>> > >
> >>> > >
> >>> > > > - Sijie
> >>> > > >
> >>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <
> >>> [email protected]>
> >>> > > > wrote:
> >>> > > > >
> >>> > > > > Hello,
> >>> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7
> >>> to 2.8.
> >>> > > > > More information is on the ticket, but the short version of the
> >>> story
> >>> > > is
> >>> > > > > that
> >>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema
> API,
> >>> by
> >>> > > > > switching SchemaInfo from a class to an interface.
> >>> > > > >
> >>> > > > > This leads to an IncompatibleClassChangeError  when you have a
> >>> Function
> >>> > > > or
> >>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you
> >>> upgrade your
> >>> > > > > Pulsar cluster (the functions worker pod for instance) from
> >>> Pulsar
> >>> > > 2.7.x
> >>> > > > to
> >>> > > > > Pulsar 2.8.0.
> >>> > > > >
> >>> > > > > The bad problem is that you cannot upgrade Pulsar without
> >>> interrupting
> >>> > > > the
> >>> > > > > service and coordinating with the upgrade of the Functions.
> >>> > > > > Your functions need to be recompiled against the Pulsar 2.8 API
> >>> and
> >>> > > > > deployed again in production.
> >>> > > > >
> >>> > > > > I have tried to move back SchemaInfo to an "abstract class" but
> >>> without
> >>> > > > > success, because then you fall into errors.
> >>> > > > >
> >>> > > > > I am not sure there is a way to provide a good "upgrade path"
> for
> >>> > > > > Functions/IO users.
> >>> > > > >
> >>> > > > > If we do not find a way we have to document the upgrade in the
> >>> official
> >>> > > > > Pulsar Documentation.
> >>> > > > >
> >>> > > > > We must do our best to prevent users from falling again into
> >>> this bad
> >>> > > > > situation.
> >>> > > > >
> >>> > > > > Any suggestions or thoughts ?
> >>> > > > >
> >>> > > > > Regards
> >>> > > > > Enrico
> >>> > > > >
> >>> > > > > [1] https://github.com/apache/pulsar/issues/11338
> >>> > > >
> >>> > >
> >>>
> >>
>

Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Reply via email to