Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Devin Bost Mon, 19 Jul 2021 11:48:40 -0700

> This leads to an IncompatibleClassChangeError  when you have a Function or
> a Connector that is using Schema.JSON(Pojo.class)


I just noticed this detail. Do we have a sense of how often people are
using Schema.JSON in Functions/Connectors?
Most of our functions are using a string schema, so it's not clear to me if
they would be impacted.

Devin G. Bost


On Mon, Jul 19, 2021 at 12:41 PM Devin Bost <devin.b...@gmail.com> wrote:

> > I think Sijie is referring to using KubernetesRuntime to deploy functions
> > where each function/source/sink runs as an independent statefulset in
> K8s.
> > In this scenario, it is possible to have fine grained control over which
> > version of the function container the function is using.
>
> Not everybody is using the KubernetesRuntime yet (especially since the
> Helm charts aren't feature-complete), and it appears that those who aren't
> running KubernetesRuntime would be impacted the most by this issue.
>
> Devin G. Bost
>
>
> On Mon, Jul 19, 2021 at 12:36 PM Devin Bost <devin.b...@gmail.com> wrote:
>
>> > For example, if you are upgrading Flink from one version to the other
>> > version, you have to make a save point in the previous version for all
>> > the Flink jobs.
>> > Upgrade the Flink cluster and resume jobs in a new version.
>> >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/
>> >
>> > So it is not unreasonable for asking people to do that when dealing
>> > with upgrading a centralized computing engine.
>>
>> One difference with Flink is that organizations running Flink in job mode
>> or application mode can upgrade jobs independently of one another, so teams
>> can upgrade jobs when they are ready without impacting other teams. In the
>> Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would
>> break every tenant simultaneously and would block the flow of all messages
>> until all functions are upgraded. If one team takes a year to upgrade their
>> one function, the cluster could not be upgraded until that happened. Also,
>> after all the functions have been upgraded, there would be production
>> downtime while deploying all the upgraded functions, which would be a major
>> outage... It might be possible to write a script to speed up the deployment
>> to shrink the outage window, but there's currently a bug that wipes out
>> existing userConfigs when a function is upgraded, so that adds to the
>> complexity of upgrading all the functions since someone would need to know
>> all the userConfigs for all the functions.
>>
>> So, I don't think we're really comparing the same things here.
>>
>> Devin G. Bost
>>
>>
>> On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <guosi...@gmail.com> wrote:
>>
>>> On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <jerry.boyang.p...@gmail.com>
>>> wrote:
>>> >
>>> > I agree that the best we can do right now is to just clearly document
>>> this
>>> > as a potential problem when updating 2.7 to 2.8.
>>> >
>>> > We should definitely make every attempt to not make BC breaking
>>> changes.
>>> > However, there are times when we have to make these tough decisions
>>> for one
>>> > reason or another. The bigger problem I see here is not necessarily a
>>> BC
>>> > breaking change occurred, but rather we didn't know about it
>>> beforehand so
>>> > we can clearly document this caveat when 2.8 is released.  Perhaps
>>> this is
>>> > where we can improve our backwards compatibility testing.  We already
>>> have
>>> > some but probably not enough as highlighted by this case.
>>> >
>>> > In regards to
>>> >
>>> > This is partially correct, because you can wait to upgrade the workers
>>> pod,
>>> > > but there is no fine grained control over which version  of each pod
>>> will
>>> > > be running your function, especially in a big cluster with many
>>> tenants and
>>> > > functions with this problem
>>> > >
>>> >
>>> >
>>> > I think Sijie is referring to using KubernetesRuntime to deploy
>>> functions
>>> > where each function/source/sink runs as an independent statefulset in
>>> K8s.
>>> > In this scenario, it is possible to have fine grained control over
>>> which
>>> > version of the function container the function is using.  There
>>> currently
>>> > might not be tools to easily allow users to do this but using kubectl
>>> one
>>> > can definitely determine which container version is running and
>>> potentially
>>> > update the container version on a per function basis.
>>>
>>> Jerry - Thank you! That was what I meant.
>>>
>>> >
>>> > Best,
>>> >
>>> > Jerry
>>> >
>>> > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <eolive...@gmail.com>
>>> > wrote:
>>> >
>>> > > Sijie,
>>> > > Thank you for your feedback
>>> > > Some additional considerations inline
>>> > >
>>> > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <guosi...@gmail.com> ha scritto:
>>> > >
>>> > > > I don't think this is a big problem. Because people can recompile
>>> the
>>> > > > function and submit the function. Most of the computing/streaming
>>> > > > engines ask users to recompile the jobs and resubmit the jobs when
>>> it
>>> > > > upgrades to a new version.
>>> > >
>>> > >
>>> > > Unfortunately this is not easily feasible if the org that is
>>> managing the
>>> > > Pulsar service is different from the org who is developing the
>>> Functions.
>>> > > And especially it is quite impossible to prevent service
>>> interruption.
>>> > >
>>> > > BTW I believe that there is no way to fix this at this point.
>>> > >
>>> > > The best approach here is to document this
>>> > > > behavior.
>>> > > >
>>> > >
>>> > > I agree that the best thing we can do is to document this
>>> requirement.
>>> > >
>>> > > Therefore we must ensure in the future that we won't fall again into
>>> this
>>> > > kind of issues.
>>> > >
>>> > > Pulsar is becoming more and more used by large enterprises and
>>> backward
>>> > > compatibility is a big value.
>>> > >
>>> > > Fortunately not all the Functions need rebuilding.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > > Also, if you are using Kubernetes runtime to schedule functions,
>>> you
>>> > > > are not really impacted.
>>> > > >
>>> > >
>>> > > This is partially correct, because you can wait to upgrade the
>>> workers pod,
>>> > > but there is no fine grained control over which version  of each pod
>>> will
>>> > > be running your function, especially in a big cluster with many
>>> tenants and
>>> > > functions with this problem
>>> > >
>>> > >
>>> > > Enrico
>>> > >
>>> > >
>>> > > > - Sijie
>>> > > >
>>> > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <
>>> eolive...@gmail.com>
>>> > > > wrote:
>>> > > > >
>>> > > > > Hello,
>>> > > > > I have reported this issue [1] about upgrading from Pulsar 2.7
>>> to 2.8.
>>> > > > > More information is on the ticket, but the short version of the
>>> story
>>> > > is
>>> > > > > that
>>> > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API,
>>> by
>>> > > > > switching SchemaInfo from a class to an interface.
>>> > > > >
>>> > > > > This leads to an IncompatibleClassChangeError  when you have a
>>> Function
>>> > > > or
>>> > > > > a Connector that is using Schema.JSON(Pojo.class) and you
>>> upgrade your
>>> > > > > Pulsar cluster (the functions worker pod for instance) from
>>> Pulsar
>>> > > 2.7.x
>>> > > > to
>>> > > > > Pulsar 2.8.0.
>>> > > > >
>>> > > > > The bad problem is that you cannot upgrade Pulsar without
>>> interrupting
>>> > > > the
>>> > > > > service and coordinating with the upgrade of the Functions.
>>> > > > > Your functions need to be recompiled against the Pulsar 2.8 API
>>> and
>>> > > > > deployed again in production.
>>> > > > >
>>> > > > > I have tried to move back SchemaInfo to an "abstract class" but
>>> without
>>> > > > > success, because then you fall into errors.
>>> > > > >
>>> > > > > I am not sure there is a way to provide a good "upgrade path" for
>>> > > > > Functions/IO users.
>>> > > > >
>>> > > > > If we do not find a way we have to document the upgrade in the
>>> official
>>> > > > > Pulsar Documentation.
>>> > > > >
>>> > > > > We must do our best to prevent users from falling again into
>>> this bad
>>> > > > > situation.
>>> > > > >
>>> > > > > Any suggestions or thoughts ?
>>> > > > >
>>> > > > > Regards
>>> > > > > Enrico
>>> > > > >
>>> > > > > [1] https://github.com/apache/pulsar/issues/11338
>>> > > >
>>> > >
>>>
>>

Re: Problems with Functions/IO in Upgrading Pulsar from 2.7 to 2.8

Reply via email to