> For example, if you are upgrading Flink from one version to the other > version, you have to make a save point in the previous version for all > the Flink jobs. > Upgrade the Flink cluster and resume jobs in a new version. > > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/ > > So it is not unreasonable for asking people to do that when dealing > with upgrading a centralized computing engine.
One difference with Flink is that organizations running Flink in job mode or application mode can upgrade jobs independently of one another, so teams can upgrade jobs when they are ready without impacting other teams. In the Pulsar case, Pulsar is multi-tenant, so upgrading the entire cluster would break every tenant simultaneously and would block the flow of all messages until all functions are upgraded. If one team takes a year to upgrade their one function, the cluster could not be upgraded until that happened. Also, after all the functions have been upgraded, there would be production downtime while deploying all the upgraded functions, which would be a major outage... It might be possible to write a script to speed up the deployment to shrink the outage window, but there's currently a bug that wipes out existing userConfigs when a function is upgraded, so that adds to the complexity of upgrading all the functions since someone would need to know all the userConfigs for all the functions. So, I don't think we're really comparing the same things here. Devin G. Bost On Mon, Jul 19, 2021 at 12:17 PM Sijie Guo <guosi...@gmail.com> wrote: > On Mon, Jul 19, 2021 at 10:32 AM Jerry Peng <jerry.boyang.p...@gmail.com> > wrote: > > > > I agree that the best we can do right now is to just clearly document > this > > as a potential problem when updating 2.7 to 2.8. > > > > We should definitely make every attempt to not make BC breaking changes. > > However, there are times when we have to make these tough decisions for > one > > reason or another. The bigger problem I see here is not necessarily a BC > > breaking change occurred, but rather we didn't know about it beforehand > so > > we can clearly document this caveat when 2.8 is released. Perhaps this > is > > where we can improve our backwards compatibility testing. We already > have > > some but probably not enough as highlighted by this case. > > > > In regards to > > > > This is partially correct, because you can wait to upgrade the workers > pod, > > > but there is no fine grained control over which version of each pod > will > > > be running your function, especially in a big cluster with many > tenants and > > > functions with this problem > > > > > > > > > I think Sijie is referring to using KubernetesRuntime to deploy functions > > where each function/source/sink runs as an independent statefulset in > K8s. > > In this scenario, it is possible to have fine grained control over which > > version of the function container the function is using. There currently > > might not be tools to easily allow users to do this but using kubectl one > > can definitely determine which container version is running and > potentially > > update the container version on a per function basis. > > Jerry - Thank you! That was what I meant. > > > > > Best, > > > > Jerry > > > > On Mon, Jul 19, 2021 at 12:50 AM Enrico Olivelli <eolive...@gmail.com> > > wrote: > > > > > Sijie, > > > Thank you for your feedback > > > Some additional considerations inline > > > > > > Il Lun 19 Lug 2021, 06:47 Sijie Guo <guosi...@gmail.com> ha scritto: > > > > > > > I don't think this is a big problem. Because people can recompile the > > > > function and submit the function. Most of the computing/streaming > > > > engines ask users to recompile the jobs and resubmit the jobs when it > > > > upgrades to a new version. > > > > > > > > > Unfortunately this is not easily feasible if the org that is managing > the > > > Pulsar service is different from the org who is developing the > Functions. > > > And especially it is quite impossible to prevent service interruption. > > > > > > BTW I believe that there is no way to fix this at this point. > > > > > > The best approach here is to document this > > > > behavior. > > > > > > > > > > I agree that the best thing we can do is to document this requirement. > > > > > > Therefore we must ensure in the future that we won't fall again into > this > > > kind of issues. > > > > > > Pulsar is becoming more and more used by large enterprises and backward > > > compatibility is a big value. > > > > > > Fortunately not all the Functions need rebuilding. > > > > > > > > > > > > > > > > Also, if you are using Kubernetes runtime to schedule functions, you > > > > are not really impacted. > > > > > > > > > > This is partially correct, because you can wait to upgrade the workers > pod, > > > but there is no fine grained control over which version of each pod > will > > > be running your function, especially in a big cluster with many > tenants and > > > functions with this problem > > > > > > > > > Enrico > > > > > > > > > > - Sijie > > > > > > > > On Fri, Jul 16, 2021 at 2:44 AM Enrico Olivelli <eolive...@gmail.com > > > > > > wrote: > > > > > > > > > > Hello, > > > > > I have reported this issue [1] about upgrading from Pulsar 2.7 to > 2.8. > > > > > More information is on the ticket, but the short version of the > story > > > is > > > > > that > > > > > in Pulsar 2.8 we introduced a breaking change in the Schema API, by > > > > > switching SchemaInfo from a class to an interface. > > > > > > > > > > This leads to an IncompatibleClassChangeError when you have a > Function > > > > or > > > > > a Connector that is using Schema.JSON(Pojo.class) and you upgrade > your > > > > > Pulsar cluster (the functions worker pod for instance) from Pulsar > > > 2.7.x > > > > to > > > > > Pulsar 2.8.0. > > > > > > > > > > The bad problem is that you cannot upgrade Pulsar without > interrupting > > > > the > > > > > service and coordinating with the upgrade of the Functions. > > > > > Your functions need to be recompiled against the Pulsar 2.8 API and > > > > > deployed again in production. > > > > > > > > > > I have tried to move back SchemaInfo to an "abstract class" but > without > > > > > success, because then you fall into errors. > > > > > > > > > > I am not sure there is a way to provide a good "upgrade path" for > > > > > Functions/IO users. > > > > > > > > > > If we do not find a way we have to document the upgrade in the > official > > > > > Pulsar Documentation. > > > > > > > > > > We must do our best to prevent users from falling again into this > bad > > > > > situation. > > > > > > > > > > Any suggestions or thoughts ? > > > > > > > > > > Regards > > > > > Enrico > > > > > > > > > > [1] https://github.com/apache/pulsar/issues/11338 > > > > > > > >